Friday, March 11, 2016

Randomly Browse through the Internet with C#

In this post I want to show how one can surf through the Internet via C# by following random links. This is not only very exciting by itself (pretty interesting to see where one ends up after a couple links), but also has practical applications: For example Google's PageRank algorithm for rating the popularity of websites uses a similar model.

The following C# program contains a webbrowser control and a button. When the user clicks the button, the program searches the current website for a random link and displays the target website in the webbrowser.
The code should be relatively self-explanatory: The class Browser covers the searching for new links. It saves the current page as well as the current page source code. If the method GoNext() is called, this calls FindLink() to find a random link. For this a random starting position in the source code of the current page (the source code is obtained via a Webclient) is chosen and then the first link after it chosen. I find this method to be more efficient than to scan through the whole document first and then choose a random link. But watch out: This way certain links are preferably chosen since we do not directly work with the probabilities of links anymore! We now work with a probability distribution over strings, and since the links are probably not uniformly distributed over the source code (for example in the beginning there is a big header etc.) our selection has a certain bias.
When we found a link we use the method from the previous post to convert relative links to absolute ones, if necessary, and follow it.
This program works but is still relative basic, for example it can run in dead ends etc., also as previously noted, the link selection is not totally random. I post it here in this form because I think that for an application it wil be customized by the user anyway, and the applications differ heavily.

So have fun trying this out and leave me interesting link chains in the comments!

The code:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Net;

namespace RandomSurfer
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        Browser B;

        private void Form1_Load(object sender, EventArgs e)
        {
            // start with some arbitrary page, here a random article on Wikipedia
            string StartPage = "https://de.wikipedia.org/wiki/Spezial:Zuf%C3%A4llige_Seite";
            B = new Browser(StartPage);
            webBrowser1.Navigate(StartPage);
        }

        private void button1_Click(object sender, EventArgs e)
        {
            webBrowser1.Navigate(B.GoNext());
        }
    }

    public class Browser
    {
        string CurrentPage; // stores url current website
        string CurrentContent; // stores content of current website

        public Browser(string Start)
        {
            CurrentPage = Start;
            CurrentContent = GetText(CurrentPage);
        }

        private string GetText(string url)
        {
            // use a webclient to download the source code of a website
            WebClient Webclient1 = new WebClient();
         
            Webclient1.Encoding = System.Text.Encoding.UTF8;
            string Content = ((Webclient1.DownloadString(url)));
            return (Content);
        }

        public string GoNext()
        {
            // randomly go to a new website
            CurrentPage = FindLink(); // for this find a random link
            CurrentContent = GetText(CurrentPage); // and for the next seach store the source code of the current webpage
            return CurrentPage;
        }

        private string FindLink()
        {
            // find a random link
            int Length = CurrentContent.Length;
            Random Rnd = new Random();
            int RndStart;
            int LinkStart = -1;
            int LinkEnd;
            string Link = "";

            // select a random starting point in the source code and find the first link after that
            // repeat if none find
            while (LinkStart == -1)
            {
                RndStart = Rnd.Next(Length);
                LinkStart = CurrentContent.IndexOf("a href=\"", RndStart);
            }

            // extract the link
            LinkEnd = CurrentContent.IndexOf("\"", LinkStart + 8);
            Link = CurrentContent.Substring(LinkStart + 8, LinkEnd - LinkStart - 8);

            // resolve its global url
            System.Uri Base = new System.Uri(CurrentPage);
            System.Uri ResolvedAbsoluteURL = new System.Uri(Base, Link);
            return ResolvedAbsoluteURL.ToString();
        }
    }
}

No comments:

Post a Comment