The challenge I faced: I wanted to catalog a blog site, format the content, and dump it into a SQL database to help out a buddy.
Solution Miststep 1:
I began to go down the road of using .Net’s WebRequest and WebResponse objects.
You find a lot about this on google, and they run like this:
// used to build entire input StringBuilder sb = new StringBuilder(); // used on each read operation byte[] buf = new byte[8192]; // prepare the web page we will be asking for HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://myblog.com/blog/"); // execute the request HttpWebResponse response = (HttpWebResponse) request.GetResponse(); // we will read data via the response stream Stream resStream = response.GetResponseStream(); string tempString = null; int count = 0; int length = 0; FileStream fs = new FileStream("test.txt",FileMode.Create); StreamWriter writer = new StreamWriter(fs); do { // fill the buffer with data count = resStream.Read(buf,0,buf.Length); // make sure we read some data if (count != 0) { // translate from bytes to ASCII text tempString = Encoding.ASCII.GetString(buf,0,count); // locate and isolate the WOD begin = tempString.IndexOf(beginstr); if (begin>-1) { Console.WriteLine(tempString); } } } while (count > 0); // any more data to read? writer.Close(); Console.WriteLine("Press enter and all that."); Console.ReadLine();
Wow, that feels like a whole lot of code to do something pretty simple?
Resolution:
A buddy stumbled across this in a post. Does everything you want above in one line. Just for kicks it also strips the HTML out for you. Nice.
public string RemoveHtml(string sURL) { try { using (System.Net.WebClient wc = new System.Net.WebClient()) return System.Text.RegularExpressions.Regex.Replace(new System.IO.StreamReader(wc.OpenRead(sURL)).ReadToEnd(), "<[^>]*>", "").ToString(); } catch (Exception ex) { MessageBox.Show(ex.Message); return null; } }
I’ll take it. If you take a look at the two sets of code… you’ll notice the second snippet creates a WebClient object and just grabs the HTML dump with OpenRead. The first snippet creates a webrequest, fires it off, gets the WebResponse. Then they fill a buffer byte array and execute a do-while loop to strip out each line. Wow! I’ll let you decide which code you’d want to use. Needless to say the smaller code footprint was faster.
Problem:
Alas, none of this was getting me any closer to being able to crawl through a specific set of URL’s. Eventually I realized I was looking for a SiteMap generator, and where best to look for advice than Google. They’ve compiled a very nice page of SiteMap generators (http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators), and although I was looking forward to building my own in C#, the first rule of programming is laziness, so I quickly found a free tool that got me closer to what I wanted to be for this task:
http://wonderwebware.com/sitemap-generator/download.html
Slick tool.
That got me a HTML dump of all the blog pages. A little notepadd++ and excel sorting got me what I wanted (date/day/content). Here’s a couple links I found along the way:
- Notepad++: how to use regular expressions:
- http://markantoniou.blogspot.com/2008/06/notepad-how-to-use-regular-expressions.html
- Regular expressions for HTML:
- http://www.rubular.com/r/I8YzL4yY1a
- Notepad++: Sorting & removing duplicate lines:
- http://stackoverflow.com/questions/3958350/notepad-removing-duplicate-rows
Then I dumped the excel into a SQL databse table. Done.
Conclusion? This was a case where I was reinventing the wheel. Standing on the shoulders of giants saved me a lot of thrashing.
Thanks,
Mike