The challenge I faced: I wanted to catalog a blog site, format the content, and dump it into a SQL database to help out a buddy.
Solution Miststep 1:
I began to go down the road of using .Net’s WebRequest and WebResponse objects.
You find a lot about this on google, and they run like this:
// used to build entire input
StringBuilder sb = new StringBuilder();
// used on each read operation
byte[] buf = new byte[8192];
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://myblog.com/blog/");
// execute the request
HttpWebResponse response = (HttpWebResponse)
request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
int length = 0;
FileStream fs = new FileStream("test.txt",FileMode.Create);
StreamWriter writer = new StreamWriter(fs);
do
{
// fill the buffer with data
count = resStream.Read(buf,0,buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf,0,count);
// locate and isolate the WOD
begin = tempString.IndexOf(beginstr);
if (begin>-1)
{
Console.WriteLine(tempString);
}
}
}
while (count > 0); // any more data to read?
writer.Close();
Console.WriteLine("Press enter and all that.");
Console.ReadLine();
Wow, that feels like a whole lot of code to do something pretty simple?
Resolution:
A buddy stumbled across this in a post. Does everything you want above in one line. Just for kicks it also strips the HTML out for you. Nice.
public string RemoveHtml(string sURL)
{
try
{
using (System.Net.WebClient wc = new System.Net.WebClient())
return System.Text.RegularExpressions.Regex.Replace(new System.IO.StreamReader(wc.OpenRead(sURL)).ReadToEnd(), "<[^>]*>", "").ToString();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
return null;
}
}
I’ll take it. If you take a look at the two sets of code… you’ll notice the second snippet creates a WebClient object and just grabs the HTML dump with OpenRead. The first snippet creates a webrequest, fires it off, gets the WebResponse. Then they fill a buffer byte array and execute a do-while loop to strip out each line. Wow! I’ll let you decide which code you’d want to use. Needless to say the smaller code footprint was faster.
Problem:
Alas, none of this was getting me any closer to being able to crawl through a specific set of URL’s. Eventually I realized I was looking for a SiteMap generator, and where best to look for advice than Google. They’ve compiled a very nice page of SiteMap generators (http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators), and although I was looking forward to building my own in C#, the first rule of programming is laziness, so I quickly found a free tool that got me closer to what I wanted to be for this task:
http://wonderwebware.com/sitemap-generator/download.html
Slick tool.
That got me a HTML dump of all the blog pages. A little notepadd++ and excel sorting got me what I wanted (date/day/content). Here’s a couple links I found along the way:
- Notepad++: how to use regular expressions:
- http://markantoniou.blogspot.com/2008/06/notepad-how-to-use-regular-expressions.html
- Regular expressions for HTML:
- http://www.rubular.com/r/I8YzL4yY1a
- Notepad++: Sorting & removing duplicate lines:
- http://stackoverflow.com/questions/3958350/notepad-removing-duplicate-rows
Then I dumped the excel into a SQL databse table. Done.
Conclusion? This was a case where I was reinventing the wheel. Standing on the shoulders of giants saved me a lot of thrashing.
Thanks,
Mike