Christopher M. Church, Scott P. McGinnis
UC Berkeley D-lab
A bookmarklet is a small bit of javascript code that runs in your browser via a bookmark. We're going to put two bookmarkelts into your bookmark bar in Chrome. Make sure your bookmark bar is enabled.
drag the following link into your bookmarks bar: Load jQuery
for more info about this bookmarklet, click here.
drag the following link into your bookmarks bar: Load Pjscrape
Then, to enable on any particular page, just click the link on your bookmark bar, which should now look like the image to the right
for more info about this bookmarklet, click here.
If you don't want to use a bookmarklet, you can manually add jQuery with the following console commands.
var jq = document.createElement('script');
jq.src = "//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js";
document.getElementsByTagName('head')[0].appendChild(jq);
jQuery.noConflict();
Chrome has a powerful set of tools designed for developers. To access it, you can either right click and inspect an element, or select it from the menu.
(note: ignore the "undefined" - it just means that the console.log function returns no value)
console.log('Hello, World!');
var my_variable = 'hello world';
console.log(my_variable);
jQuery is a simplification of JavaScript used to select and manipulate a lot of elements at one time. Click here for more details
It follows the basic format:
$('SELECT SOMETHING').DoSomething();
Please visit the wiki page on XML Once on the wiki page, click on your jQuery bookmarklet and then open the Javascript Console. We are going to enter the jQuery commands below into the JavaScript console and see what they do. Note: the title bar of the Console will tell you what page you are currently acting on.
$('p');
$('p').css('color','blue');
$('p').text();
console.log($('p').text());
var ptext='';
$('p').each(function() {ptext = ptext +'\n' + $(this).text();});
console.log(ptext);
$('a').each(function(){console.log($(this).attr('href'));});
var txt=''; //declare variable
$('h4').each(function() {txt = txt + $(this).text()+'\n';}); //find each h4 element and add it to variable txt
$('#output').val(txt); //put the txt variable into the output field
To read local files in Chrome, you need to allow this with the command line switch "chrome.exe --allow-file-access-from-files" Since we are on a webserver, we can only load files from the same domain. This is called the Same Origin Policy.
First, take a look at "books.xml" to see what it contains.
We are now going to load the xml file.
$.ajax({
type: "GET",
url: "xml/books.xml",
dataType: "xml",
success: function(xml) {
$('#output').val($('Book',xml).text());
}
});
var xmlDoc;
Load XML file and store it in variable xmlDoc
$.ajax({
type: "GET",
url: "xml/books.xml",
dataType: "xml",
success: function(xml) {xmlDoc=xml;}
});
Send all books in the XML doc to output field
$('#output').val($(xmlDoc).find("Book").text());
$('#output').val($('Title',xmlDoc).text());
$('Book',xmlDoc).attr('subject','programming');
$("Book:contains('SQL')",xmlDoc).attr('subject','MySQL');
Click here to reveal the answer
xmlString = (new XMLSerializer()).serializeToString(xmlDoc);
$('#output').val(xmlString);
Now that you've seen how selectors work, you are ready to do some real web-scraping
download Phantomjs at http://phantomjs.org/, which allows you to run javascript from your command prompt or terminal
download pjscrape from http://nrabinowitz.github.com/pjscrape/, which allows you to scrape data from websites
You can download pjscrape and phantomjs already set up from the dropbox folder.
This work and all pages herein are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.