Friday, March 11, 2016

Randomly Browse through the Internet with C#

In this post I want to show how one can surf through the Internet via C# by following random links. This is not only very exciting by itself (pretty interesting to see where one ends up after a couple links), but also has practical applications: For example Google's PageRank algorithm for rating the popularity of websites uses a similar model.

The following C# program contains a webbrowser control and a button. When the user clicks the button, the program searches the current website for a random link and displays the target website in the webbrowser.
The code should be relatively self-explanatory: The class Browser covers the searching for new links. It saves the current page as well as the current page source code. If the method GoNext() is called, this calls FindLink() to find a random link. For this a random starting position in the source code of the current page (the source code is obtained via a Webclient) is chosen and then the first link after it chosen. I find this method to be more efficient than to scan through the whole document first and then choose a random link. But watch out: This way certain links are preferably chosen since we do not directly work with the probabilities of links anymore! We now work with a probability distribution over strings, and since the links are probably not uniformly distributed over the source code (for example in the beginning there is a big header etc.) our selection has a certain bias.
When we found a link we use the method from the previous post to convert relative links to absolute ones, if necessary, and follow it.
This program works but is still relative basic, for example it can run in dead ends etc., also as previously noted, the link selection is not totally random. I post it here in this form because I think that for an application it wil be customized by the user anyway, and the applications differ heavily.

So have fun trying this out and leave me interesting link chains in the comments!

The code:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Net;

namespace RandomSurfer
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        Browser B;

        private void Form1_Load(object sender, EventArgs e)
        {
            // start with some arbitrary page, here a random article on Wikipedia
            string StartPage = "https://de.wikipedia.org/wiki/Spezial:Zuf%C3%A4llige_Seite";
            B = new Browser(StartPage);
            webBrowser1.Navigate(StartPage);
        }

        private void button1_Click(object sender, EventArgs e)
        {
            webBrowser1.Navigate(B.GoNext());
        }
    }

    public class Browser
    {
        string CurrentPage; // stores url current website
        string CurrentContent; // stores content of current website

        public Browser(string Start)
        {
            CurrentPage = Start;
            CurrentContent = GetText(CurrentPage);
        }

        private string GetText(string url)
        {
            // use a webclient to download the source code of a website
            WebClient Webclient1 = new WebClient();
         
            Webclient1.Encoding = System.Text.Encoding.UTF8;
            string Content = ((Webclient1.DownloadString(url)));
            return (Content);
        }

        public string GoNext()
        {
            // randomly go to a new website
            CurrentPage = FindLink(); // for this find a random link
            CurrentContent = GetText(CurrentPage); // and for the next seach store the source code of the current webpage
            return CurrentPage;
        }

        private string FindLink()
        {
            // find a random link
            int Length = CurrentContent.Length;
            Random Rnd = new Random();
            int RndStart;
            int LinkStart = -1;
            int LinkEnd;
            string Link = "";

            // select a random starting point in the source code and find the first link after that
            // repeat if none find
            while (LinkStart == -1)
            {
                RndStart = Rnd.Next(Length);
                LinkStart = CurrentContent.IndexOf("a href=\"", RndStart);
            }

            // extract the link
            LinkEnd = CurrentContent.IndexOf("\"", LinkStart + 8);
            Link = CurrentContent.Substring(LinkStart + 8, LinkEnd - LinkStart - 8);

            // resolve its global url
            System.Uri Base = new System.Uri(CurrentPage);
            System.Uri ResolvedAbsoluteURL = new System.Uri(Base, Link);
            return ResolvedAbsoluteURL.ToString();
        }
    }
}

Monday, March 7, 2016

Resolve Relative URL with C#

Recently I wanted to browse a Webpage for links and then follow them with C#. During this of course also relative links were found and I found out that it is not that easy to in the Internet find a way to resolve these with C#. First I thought about writing my own function, but then I found a very easy method for this which I want to share here.
Let us begin with something about links in the Internet: In HTML one can refer to another page, thus create a link, as follows:

<a href="http://www.sudokusoftheday.blogspot.de/">Linktext</a>

In the above link I gave an absolute URL as target, which can be seen by http://www.
But I also can refer to pages relative to my current page, for example:

<a href="../../p/youtube-channel.html">Linktext</a>

This link calls the page http://csharp-tricks-en.blogspot.de/p/youtube-channel.html . Since this post is located in the virtual folder "/2016/03/", we browse two folders upwards towards the root URL via "../../", and from there call the page /p/youtube-channel.html.

When absolute links are encountered, this is no problem, we can simply follow them. But if we browse one page and want to follow a local link, this is a bit more complicated, also since all other known path expressions are allowed, like "../" seen above.

To not have to build these paths manually together one can use the class  System.Uri in the form System.Uri RelativeURL = new System.Uri(BaseUri, "relative Path");
As an example let us consider the Wikipedia article about the "Pronghorn". In this there is a relative link to the Wikipedia article about "Deer":

<a href="/wiki/Deer" title="Deer">deer</a>

To get the valid absolute link we execute the following code:
System.Uri Base = new System.Uri("https://en.wikipedia.org/wiki/Pronghorn");
System.Uri ResolvedAbsoluteURL = new System.Uri(Base, "/wiki/Deer");