Introduction to Web Scraping with HttpWebRequest using ASP.NET MVC 3

So, before we begin. Let me quickly highlight the points that I will be covering in this article

1. What is Web Scraping ?

2. Difference between Web Crawling and Web Scraping ?

3. Web Scraping using ASP.NET MVC

4. Summary / Further Reading.

So, lets get started…

1. What is Web Scraping ? (Wiki)

Web scraping is a computer technique of extracting information from a website. This is done by creating programs that processes the html web pages of the target web site, and extracting information out of it.

2. Difference between Web Crawling and Web Scraping ?

Crawling” refers to automatically retrieving web pages and following links to find still more web pages.

Scraping” means parsing those pages to extract pieces of information in a structured way. It also refers to creating a programmatic interface, an API, that interacts with a site through an HTML interface meant for humans.

In short “Crawling implies indexing, whereas scraping implies copying the content.”

3. Web Scraping using ASP.NET MVC

Below is what I had written to scrape data from my WordPress website, I have added comments wherever applicable to make the code easier to read.

Below are the constant that you need to define, ‘UserName’ and ‘Pwd’ are the login details to my WordPress account, ‘Url’ stand for the login page url and ‘ProfileUrl’ is the address of the page where the profile details are shown.

const string Url = "http://yassershaikh.com/wp-login.php";  
const string UserName = "guest";  
const string Pwd = ".netrocks!!"; // n this not my real pwd :P  
const string ProfileUrl = "http://yassershaikh.com/wp-admin/profile.php";  


public ActionResult Index()  
{  
    string postData = Crawler.PreparePostData(UserName, Pwd, Url);  
    byte[] data = Crawler.GetEncodedData(postData);

    string cookieValue = Crawler.GetCookie(Url, data);

    var model = Crawler.GetUserProfile(ProfileUrl, cookieValue);

    return View(model);  
}  

I had created a static class called “Crawler”, here’s the code for it.

// preparing post data  
public static string PreparePostData(string userName, string pwd, string url)  
{  
    var postData = new StringBuilder();  
    postData.Append("log=" + userName);  
    postData.Append("&");  
    postData.Append("pwd=" + pwd);  
    postData.Append("&");  
    postData.Append("wp-submit=Log+In");  
    postData.Append("&");  
    postData.Append("redirect_to=" + url);  
    postData.Append("&");  
    postData.Append("testcookie=1");

    return postData.ToString();  
}

public static byte[] GetEncodedData(string postData)  
{  
    var encoding = new ASCIIEncoding();  
    byte[] data = encoding.GetBytes(postData);  
    return data;  
}

public static string GetCookie(string url, byte[] data)  
{  
    var webRequest = (HttpWebRequest)WebRequest.Create(url);  
    webRequest.Method = "POST";  
    webRequest.ContentType = "application/x-www-form-urlencoded";  
    webRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2";  
    webRequest.AllowAutoRedirect = false;

    Stream requestStream = webRequest.GetRequestStream();  
    requestStream.Write(data, 0, data.Length);  
    requestStream.Close();

    var webResponse = (HttpWebResponse)webRequest.GetResponse();

    string cookievalue = string.Empty;  
    if (webResponse.Headers != null && webResponse.Headers["Set-Cookie"] != null)  
    {  
        cookievalue = webResponse.Headers["Set-Cookie"];

        // Modify CookieValue  
        cookievalue = GenerateActualCookieValue(cookievalue);  
    }

    return cookievalue;  
}

public static string GenerateActualCookieValue(string cookievalue)  
{  
    var seperators = new char[] { ';', ',' };  
    var oldCookieValues = cookievalue.Split(seperators);

    string newCookie = oldCookieValues[2] + ";" + oldCookieValues[0] + ";" + oldCookieValues[8] + ";" + "wp-settings-time-2=1345705901";  
    return newCookie;  
}

public static List<string> GetUserProfile(string profileUrl, string cookieValue)  
{  
    var webRequest = (HttpWebRequest)WebRequest.Create(profileUrl);

    webRequest.Method = "GET";  
    webRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2";  
    webRequest.AllowAutoRedirect = false;

    webRequest.Headers.Add("Cookie", cookieValue);

    var responseCsv = (HttpWebResponse)webRequest.GetResponse();  
    Stream response = responseCsv.GetResponseStream();

    var htmlDocument = new HtmlDocument();  
    htmlDocument.Load(response);

    var responseList = new List<string>();

    // reading all input tags in the page  
    var inputs = htmlDocument.DocumentNode.Descendants("input");

    foreach (var input in inputs)  
    {  
        if (input.Attributes != null)  
        {  
            if (input.Attributes["id"] != null && input.Attributes["value"] != null)  
            {  
                responseList.Add(input.Attributes["id"].Value + " = " + input.Attributes["value"].Value);  
            }  
        }  
    }

    return responseList;  
}  

4. Summary / Further Reading.

2 Comments

  1. Nikhil Sachdeva June 27, 2013 at 12:04 pm

    Hi,

    i have used this code but getting exception “‘stream.Length’ threw an exception of type ‘System.NotSupportedException’” in function GetUserProfile response stream. am i sending wrong cookies ???.

    Please help me..

    Thanks in advance..

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>