Icedawn
Aug 10th, 2007, 12:22 AM
I stupidly suggested that a solution could be easily found for a problem we faced at work... and now I realized I don't quite know how to solve it, although I know in principle its doable.
Problem:
Federal Court Website offers a way to search by file number.. which are handed out sequentially. However, you can't do a search with a wild key.... so if you wanted to get file numbers T-0001-07 -> T-9999-07 (the last two numbers is the year), you'd have to submit a search 10000 times.
I"m SURE there's a way to automate the form submission... any suggestions on how?
Here's the site:
http://cas-ncr-nter03.cas-satj.gc.ca/IndexingQueries/infp_queries_e.php?stype=court&select_court=T
Using input data T-2-07 works for example
chatbox
Aug 10th, 2007, 09:03 AM
Alright, you'll need three things:
1. A .BAT file (I'll include this for you here).
2. WinHTTrack (software to download html when given a link...in this case, a bunch of links from the above .BAT output).
3. Google Desktop Search (this will give you the ability to search your local/downloaded copies of the HTML "COURT INDEX AND DOCKET" from WinHTTrack)
I'm assuming you're using Windows.
For the .BAT:
1. Start -> Run -> input "Notepad".
2. Paste the following two lines into notepad and save it to your C:\Temp with links.bat as the file name.
@echo off
for /l %%a in (1,1,10000) do echo http://cas-ncr-nter03.cas-satj.gc.ca/IndexingQueries/infp_moreInfo_e.php?T-%%a-07 >> links.txt
3. Run the links.bat, it'll create a text file named links.txt. In it you'll find 10000 HTML links.
For WinHTTrack:
1. Download it from Download.com or something, can't remember where exactly.
2. Install it.
3. Start WinHTTrack and create a new project.
4. In the new project, define the followings:
Project name: Court Indexes
Click Next.
URL list (.txt): point it to the links.txt file created above.
Click "Set options".
Under the "Limits" tab, set "Maximum mirroring depth" to 0.
Under the "Browser ID" tab, set "Browser Identity" to "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" (first on the list).
Also set HTML footer to "none".
Click ok.
Click Next and Click Finish.
It'll now start downloading the 10000 HTML documents to your local harddisk, might take some time.
After a....long while. The files will be sitting in your C:\My Web Sites\<project name>\cas-ncr-nter03.cas-satj.gc.ca directory.
For Google Desktop Search:
1. Download it from Google.
2. Install it.
3. Set it to index the above directory of your project.
It will take some time for it to index the HTML files.
....and then...
Way hey, you're done.
Now, you can just tap the Ctrl key twice, and google desktop search will pop up the search strip in the middle of your screen. You can then search by using whatever words you normally use and it will show up the matching HTML files.
Enjoy and have fun.
Icedawn
Aug 11th, 2007, 10:21 AM
aww.. maan... you rock.
I'll throw together a quick parser to pull out the relevant information and then I'll be good to go.
thanks again.