Skip to content

Appengine wo blobs #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 72 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
42ebc29
[Tools][createdb]
Sep 17, 2010
0512d50
[ItemPipeline]
Sep 17, 2010
cc25834
* added utils and settings directories to root path of the project
Sep 19, 2010
1606596
[web2py]
Sep 19, 2010
1f6b979
initial commit of *not finished* firefox add-on
sardok Oct 14, 2010
7e8b6e6
Replaced hardcoded 'bookcrawler' with 'kitapsever'.
Nov 10, 2010
dde1161
Set default database table from bookcrawler to kitapsever
Nov 10, 2010
01a23e6
database settings file update
Nov 10, 2010
deb2200
inital commit of Makefile for firefox-extension
Dec 3, 2010
4c40002
[firefox-addon] Removed KitapSever.xpi, user need to execute make com…
Dec 9, 2010
41b3dc0
[firefox-addon] make forces to clean available .xpi file.
Dec 13, 2010
bdb525a
[firefox-addon] Updated project home page url.
Dec 13, 2010
c3818fe
[firefox-addon]
Dec 13, 2010
600cf6d
[firefox-extension]
Dec 16, 2010
56ad230
[firefox-addon] Better formatted popup message.
Dec 21, 2010
420633f
[firefox addon] Added icon.
Feb 8, 2011
e4a1bcc
[firefox-addon] updated icon.
Feb 9, 2011
7d93675
[firefox-addon] Disabled autorun for now.
Feb 9, 2011
72760a8
Veriler veritabanında depolanmak yerine dosyaya basılıyor.
Feb 14, 2011
5314f97
Fix price extraction error in ideefix.com spider
Feb 14, 2011
d0ede94
Remove redundant line
Feb 14, 2011
24b8bc8
Initial checkin for kitapsever Google App Engine application.
Feb 14, 2011
5c18456
Add a new handler to query books by isbn.
Feb 14, 2011
fcb640d
Make crawler get book service address from settings
Feb 14, 2011
a0f7873
Make chrome-extension use new Google AppEngine service as book service.
Feb 14, 2011
6c6a6ce
Added new 128 pixel icon
Feb 15, 2011
cc3e101
Fixed ilknokta.com spider
Feb 15, 2011
03e1737
Fix for pandora.com.tr spider
Feb 15, 2011
6ddb94d
Fixed Pandora.com.tr spider
Feb 16, 2011
7b4a92d
Extracted a new AppEngineExportPipeline from FileExportPipeline
Feb 16, 2011
050df04
Replaced urllib2 with urllib.
Feb 16, 2011
0f9c540
Make Pandora.com.tr spider deny beyoglu.pandora.com.tr links which ap…
Feb 16, 2011
d535d0e
Disabled FileExportPipeline
Feb 16, 2011
58b6d10
Parse ISBNs that ends with X as check digit
Feb 16, 2011
2de6712
Replaced db.GqlQuery calls with Model.gql calls
Feb 16, 2011
c8e866c
Added AllBooksByISBNQueryHandler to list all the book entries with a …
Feb 16, 2011
01fe533
Changed Clean handler such that it deletes 5000 book entries at once
Feb 16, 2011
026fb25
Removed createdb.py and readdb.py helper scripts.
Feb 16, 2011
f7d21a9
Make chrome-extension parse ISBNs that ends with X
Feb 17, 2011
8d344f6
Set the version to 0.1.1
Feb 17, 2011
0619b09
Updated README with branch info.
sardok Feb 18, 2011
10ffaf5
initial commit of *not finished* firefox add-on
sardok Oct 14, 2010
bb45e4e
inital commit of Makefile for firefox-extension
Dec 3, 2010
caf8191
[firefox-addon] Removed KitapSever.xpi, user need to execute make com…
Dec 9, 2010
933b123
[firefox-addon] make forces to clean available .xpi file.
Dec 13, 2010
e802107
[firefox-addon] Updated project home page url.
Dec 13, 2010
5cd61b2
[firefox-addon]
Dec 13, 2010
d9f7d27
[firefox-extension]
Dec 16, 2010
697549a
[firefox-addon] Better formatted popup message.
Dec 21, 2010
ad3918a
[firefox addon] Added icon.
Feb 8, 2011
51b715e
[firefox-addon] updated icon.
Feb 9, 2011
1cfb340
[firefox-addon] Disabled autorun for now.
Feb 9, 2011
34875cc
Some cleaning plus made code resemble more to chrome-extension codes …
Feb 18, 2011
d19411d
[appengine] BookQueryHandler does not need to return results back.
Feb 19, 2011
201bf03
[firefox-addon] Enabled LOAD_BYPASS_CACHE on connection between serve…
sardok Feb 21, 2011
c0eb2c0
[chrome-extension] Moved ISBN parser logic from the end to the beginn…
Feb 20, 2011
67eed3e
Added DropItem exception raising logic
Feb 21, 2011
a11c70c
Price's input_processor logic uses TakeLast logic as opposed to TakeF…
Feb 21, 2011
b74d62d
Fix for netkitap.com spider
Feb 21, 2011
bb56ccd
Merge remote branch 'sardok/appengine_wo_blobs'
Mar 23, 2011
40b85ae
Merge remote branch 'rimbi/appengine_wo_blobs'
Mar 23, 2011
7eb3c79
[crawler] Added run_all.sh script to run all the spiders respectively.
sardok Mar 23, 2011
6ac6a68
[crawler] Added usefule options to run script of crawler.
Mar 23, 2011
660c06d
[crawler] removed deprecated run_all script
Mar 23, 2011
60c93ca
[firefox-addon] Removed meaningless items.
Mar 24, 2011
b0825ad
[firefox-addon]
Mar 24, 2011
f120e27
[firefox-addon] Only one notification box instance should be displayed.
Mar 24, 2011
10f0a20
[firefox-addon]
Mar 24, 2011
b7ad6c4
[firefox-addon] C style comments.
Mar 25, 2011
3c40141
[firefox-addon]
Mar 25, 2011
ef85afa
[Spider: kitapyurdu] Fix price parser.
Apr 11, 2011
8a26b18
[Spider: pandora] Fix price parse.
Jun 27, 2011
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Bookcrawler branch which works on google appengine.
7 changes: 4 additions & 3 deletions chrome-extension/contentscript.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@

var re = /ISBN[:\s]*([X0-9\\-]*)/g;
rawISBN = document.body.innerText.match(re)[0];
isbn = rawISBN.replace(/-/g, "").slice(-10, -1)

function createRootElement(id) {
root = document.createElement("div");
root.id = id;
Expand Down Expand Up @@ -46,9 +50,6 @@ function showBooks(responseText) {
}
}

var re = /ISBN[:\s]*([0-9\\-]*)/g;
rawISBN = document.body.innerText.match(re)[0];
isbn = rawISBN.replace(/-/g, "").slice(-10)
//alert(isbn);
chrome.extension.sendRequest({'action' : 'fetchBooks', 'selectedText' : isbn}, showBooks);

Binary file removed chrome-extension/icon.png
Binary file not shown.
Binary file added chrome-extension/icon_128.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion chrome-extension/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<body>
<script type="text/javascript">
function sendServiceRequest(selectedText, callback) {
var serviceCall = "http://127.0.0.1:8000/myapp/default/query.xml?column_name=isbn&query_string=" + selectedText;
var serviceCall = "http://rimbiskitapsever.appspot.com/bookbyisbn?isbn=" + selectedText;
var req = new XMLHttpRequest();
req.open("GET", serviceCall, true);
req.onload = showBooks;
Expand Down
8 changes: 2 additions & 6 deletions chrome-extension/manifest.json
Original file line number Diff line number Diff line change
@@ -1,14 +1,10 @@
{
"name": "Kitapsever",
"version": "1.0",
"version": "0.1.1",
"description": "Kitapsever için en uygun kitabı bulur.",
// "browser_action": {
// "default_icon": "icon.png"
// },

"icons": {
"48" : "icon.png",
"128" : "icon.png"
"128" : "icon_128.png"
},
"background_page" : "index.html",
"permissions": [
Expand Down
3 changes: 2 additions & 1 deletion crawler/crawler/items.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field
from scrapy.contrib.loader.processor import Join, TakeFirst
from scrapy.contrib.loader.processor import Join, TakeFirst, Compose

class BookItem(Item):
# define the fields for your item here like:
Expand All @@ -32,6 +32,7 @@ class BookItem(Item):
price = Field(
default = u'0 TL',
output_processor = TakeFirst(),
input_processor = Compose(lambda v: v[-1:]),
)
store = Field(
default = 0,
Expand Down
102 changes: 45 additions & 57 deletions crawler/crawler/pipelines.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,74 +6,62 @@
from scrapy.xlib.pydispatch import dispatcher
from scrapy.core import signals
from string import replace
from sqlalchemy import create_engine
from sqlalchemy import Table, Column, Integer, Float, Unicode, MetaData, and_
from sqlalchemy.orm import mapper, sessionmaker
from crawler.settings import BOOK_SERVICE_ADDRESS
import urllib
from scrapy.core.exceptions import DropItem

class Book(object):
def __init__(self, name, isbn, author, publisher, link, price, store):
self.name = name
self.isbn = isbn
self.author = author
self.publisher = publisher
self.link = link
self.price = price
self.store = store
ITEM_SEPERATOR = ";"

def __repr__(self):
return u"<Book('%s', '%s', '%s', '%s', '%s', '%f' '%d')>" % (self.name, self.isbn, self.author, self.publisher, self.link, self.price, self.store)

metadata = MetaData()

books_table = Table('books', metadata,
Column('id', Integer, primary_key=True),
Column('isbn', Unicode(255)),
Column('name', Unicode(255)),
Column('author', Unicode(255)),
Column('publisher', Unicode(255)),
Column('link', Unicode(255)),
Column('price', Float(precision=2)),
Column('store', Integer))

mapper(Book, books_table)
class AppEngineExportPipeline(object):
def process_item(self, spider, item):
try:
link = item['link'].strip()
isbn = item['isbn'].strip().replace("-", "")
if len(isbn) >= 10:
isbn = isbn[-10:-1]
price = replace(item['price'], ',', '.')
store = str(item['store'])
line = isbn + ITEM_SEPERATOR
line = line + link + ITEM_SEPERATOR
line = line + price + ITEM_SEPERATOR
line = line + store + "\n"
params = urllib.urlencode({'isbn': isbn, 'price': price, 'store': store, 'link': link})
f = urllib.urlopen(BOOK_SERVICE_ADDRESS + '?%s' % params)
f.close()
except AttributeError:
print "Attribute error in parsing item at %s" % link
raise DropItem()

class DbExportPipeline(object):
i = 0
return item

class FileExportPipeline(object):
def __init__(self):
dispatcher.connect(self.spider_opened, signals.spider_opened)
dispatcher.connect(self.spider_closed, signals.spider_closed)
self.session = None
self.out_file = None

def spider_opened(self, spider):
self.session = sessionmaker(bind=create_engine('mysql://root:123456@localhost/bookcrawler', echo=True))()
DbExportPipeline.i += 1
self.out_file = open(spider.domain_name + ".txt", "w")

def spider_closed(self, spider):
DbExportPipeline.i -= 1
if DbExportPipeline.i == 0:
self.session.close()
self.out_file.close()

def process_item(self, spider, item):
book_isbn = item['isbn'].strip().replace("-", "")
if len(book_isbn) == 13:
book_isbn = book_isbn[-10:]
book_name = unicode(item['name'].strip())
book_author = unicode(item['author'].strip())
book_publisher = unicode(item['publisher'].strip())
book_link = unicode(item['link'].strip())
book_price = float(replace(item['price'], ',', '.'))
book_store = item['store']
book = self.session.query(Book).filter(and_(Book.isbn == book_isbn, Book.store == book_store)).first()
if book is None:
book = Book(book_name, book_isbn, book_author, book_publisher, book_link, book_price, book_store)
self.session.add(book)
else:
book.price = book_price
book.name = book_name
book.author = book_author
book.publisher = book_publisher
book.link = book_link
self.session.flush()
self.session.commit()
try:
link = item['link'].strip()
isbn = item['isbn'].strip().replace("-", "")
if len(isbn) >= 10:
isbn = isbn[-10: -1]
price = replace(item['price'], ',', '.')
store = str(item['store'])
line = isbn + ITEM_SEPERATOR
line = line + link + ITEM_SEPERATOR
line = line + price + ITEM_SEPERATOR
line = line + store + "\n"
self.out_file.write(line)
except AttributeError:
print "Attribute error in parsing item at %s" % link
raise DropItem()

return item

9 changes: 7 additions & 2 deletions crawler/crawler/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@
DEFAULT_ITEM_CLASS = 'crawler.items.BookItem'
#USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13'
ITEM_PIPELINES = ['crawler.pipelines.DbExportPipeline']
ITEM_PIPELINES = [
# 'crawler.pipelines.FileExportPipeline',
'crawler.pipelines.AppEngineExportPipeline'
]
CONCURRENT_REQUESTS_PER_SPIDER = 1
DOWNLOAD_DELAY = 2
DOWNLOAD_DELAY = 1
BOOK_SERVICE_ADDRESS = 'http://rimbiskitapsever.appspot.com/book'
#BOOK_SERVICE_ADDRESS = 'http://localhost:8080/book'
4 changes: 2 additions & 2 deletions crawler/crawler/spiders/ideefixe.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ class IdefixSpider(CrawlSpider):
def parse_item(self, response):
l = XPathItemLoader(item=BookItem(), response=response)
l.add_xpath('name', '//div[@class=\'boxTanimisim\']/div/text()')
l.add_xpath('isbn', '//div[@id=\'tanitimbox\']/text()', u'.*ISBN : ([0-9]+)')
l.add_xpath('isbn', '//div[@id=\'tanitimbox\']/text()', u'.*ISBN : ([0-9X]+)')
l.add_xpath('author', '//div[@class=\'boxTanimVideo\']/a/text()')
l.add_xpath('publisher','//h3[@class=\'boxTanimyayinevi\']/a/b/text()')
l.add_xpath('price', '//b[@class=\'pricerange\']/text()', u'\s*(.*) TL \(KDV Dahil\)')
l.add_xpath('price', '//b[@class=\'pricerange\']/text()', u'\s*([0-9,]*) TL \(KDV Dahil\)')
l.add_value('link', response.url)
l.add_value('store', 2)
return l.load_item()
Expand Down
6 changes: 3 additions & 3 deletions crawler/crawler/spiders/ilknokta.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ class IlknoktaSpider(CrawlSpider):
start_urls = ['http://www.ilknokta.com/']

rules = (
Rule(SgmlLinkExtractor(allow=(r'/urun/.*', ), unique=True), 'parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=(r'/kitap/.*', ), unique=True), 'parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=(r'/.*', ), unique=True), ),
)

def parse_item(self, response):
l = XPathItemLoader(item=BookItem(), response=response)
l.add_xpath('name', '//font[@class=\'baslikt\']/strong/text()')
l.add_xpath('isbn', '//td/text()', u'.*ISBN: ([0-9\-]+)')
l.add_xpath('name', '//div[@class="divbaslik"]/@title')
l.add_xpath('isbn', '//td/text()', u'.*ISBN: ([0-9\-X]+)')
l.add_xpath('author', '//td[@class=\'yazart\']/a/text()')
l.add_xpath('publisher','//a[@class=\'yayineviU\']/text()')
l.add_xpath('price', '//font[@class=\'fiyat\']/text()', u'([0-9,]+) TL')
Expand Down
4 changes: 2 additions & 2 deletions crawler/crawler/spiders/imge.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ class ImgeSpider(CrawlSpider):
def parse_item(self, response):
l = XPathItemLoader(item=BookItem(), response=response)
l.add_xpath('name', '//td[@class=\'pageHeading\']/text()')
l.add_xpath('isbn', '//td[@class=\'main\']/text()', u'ISBN: ([0-9]+)')
l.add_xpath('isbn', '//td[@class=\'main\']/p/text()', u'ISBN: ([0-9]+)')
l.add_xpath('isbn', '//td[@class=\'main\']/text()', u'ISBN: ([0-9X]+)')
l.add_xpath('isbn', '//td[@class=\'main\']/p/text()', u'ISBN: ([0-9X]+)')
l.add_xpath('isbn', '//td[@class=\'main\']/p/text()', u'Barkod: ([0-9]+)')
l.add_xpath('author', '//a[contains(@href, "/person.php")]/b/font/text()')
l.add_xpath('publisher','//a[contains(@href, "manufacturers_id=")]/b/font/text()')
Expand Down
4 changes: 2 additions & 2 deletions crawler/crawler/spiders/kitapyurdu.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ class KitapyurduSpider(CrawlSpider):
def parse_item(self, response):
l = XPathItemLoader(item=BookItem(), response=response)
l.add_xpath('name', '//span[@class=\'kitapismi\']/text()')
l.add_xpath('isbn', '//span[@class=\'normalkucuk\']/text()', u'ISBN:([0-9]+)')
l.add_xpath('isbn', '//span[@class=\'normalkucuk\']/text()', u'ISBN:([0-9X]+)')
l.add_xpath('author', '//span/a[contains(@href, "/yazar/")]/text()')
l.add_xpath('publisher','//span/a[contains(@href, "/yayinevi/")]/text()')
l.add_xpath('price', '//td/text()', u'Kitapyurdu Fiyatı:(.*) TL\.')
l.add_xpath('price', '//td/text()', u'Kitapyurdu Fiyatı:\s([0-9,]*).*')
l.add_value('link', response.url)
l.add_value('store', 3)
return l.load_item()
Expand Down
5 changes: 2 additions & 3 deletions crawler/crawler/spiders/netkitap.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,10 @@ class NetkitapSpider(CrawlSpider):
def parse_item(self, response):
l = XPathItemLoader(item=BookItem(), response=response)
l.add_xpath('name', '//h1[@class=\'kitapad14pnt\']/b/text()')
l.add_xpath('isbn', '//span[@class=\'kunye\']/text()', u'ISBN: ([0-9\-]+)')
l.add_xpath('isbn', '//span[@class=\'kunye\']/text()', u'ISBN: ([0-9\-X]+)')
l.add_xpath('author', '//span[@class=\'yazarad12pnt\']/a/span[@class=\'yazarad12pnt\']/text()')
l.add_xpath('publisher','//h3[@class=\'kapakyazisi\']/b/font/a/text()')
l.add_xpath('price', '//span[@class=\'kapakyazisi\']/font/b/text()', u'(.*) TL')
l.add_xpath('price', '//span[@class=\'kapakyazisi\']/b/text()', u'(.*) TL')
l.add_xpath('price', '//span[@class="kapakyazisi"]/font/b/text()', u'(.*) TL')
l.add_value('link', response.url)
l.add_value('store', 5)
return l.load_item()
Expand Down
12 changes: 6 additions & 6 deletions crawler/crawler/spiders/pandora.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,17 @@ class PandoraSpider(CrawlSpider):
start_urls = ['http://www.pandora.com.tr/']

rules = (
Rule(SgmlLinkExtractor(allow=(r'/urun\.aspx\?id=',), unique=True), 'parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=(r'/urun/.*',), deny_domains='beyoglu.pandora.com.tr', unique=True), 'parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=(r'/.*', ), unique=True)),
)

def parse_item(self, response):
l = XPathItemLoader(item=BookItem(), response=response)
l.add_xpath('name', '//span[@id=\'ctl00_ContentPlaceHolderMainOrta_LabelAdi\']/text()')
l.add_xpath('isbn', '//span[@id=\'ctl00_ContentPlaceHolderMainOrta_LabelIsbn\']/text()')
l.add_xpath('author', '//span[@id=\'ctl00_ContentPlaceHolderMainOrta_LabelYazar\']/a/text()')
l.add_xpath('publisher','//a[@id=\'ctl00_ContentPlaceHolderMainOrta_HyperLinkYayinci\']/text()')
l.add_xpath('price', '//span[@class=\'fiyat\']/text()', u'(.*) TL')
l.add_xpath('name', '//span[@id="ContentPlaceHolderMainOrta_LabelAdi"]/text()')
l.add_xpath('isbn', '//span[@id="ContentPlaceHolderMainOrta_LabelIsbn"]/text()')
l.add_xpath('author', '//span[@id="ContentPlaceHolderMainOrta_LabelYazar"]/a/text()')
l.add_xpath('publisher','//a[@id="ContentPlaceHolderMainOrta_HyperLinkYayinci"]/text()')
l.add_xpath('price', '//span[@id=\'ContentPlaceHolderMainOrta_LabelFiyat\']/span[@class=\'fiyat\']/text()', u'(.*) TL')
l.add_value('link', response.url)
l.add_value('store', 4)
return l.load_item()
Expand Down
30 changes: 0 additions & 30 deletions crawler/createdb.py

This file was deleted.

68 changes: 68 additions & 0 deletions crawler/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/bin/sh

crawl_list="./scrapy-ctl.py list"
crawl_exec="./scrapy-ctl.py crawl"
pattern=".(com|net|gen|org)(.tr|)$"

list=0

while getopts "o:lh" optname
do
case "$optname" in
o)
echo "Warning: Only run a site"
site=$OPTARG
;;
h)
cat << EOF
usage:
./run [-o, -h] [site name]
-o: Optional parameter. Indicates that script should run only for a site instance.
If site is provided as parameter, script should exit after finishes its execution.

-l: Optional parameter. Lists the available book sites.

-h: Optional parameter. Prints this message.

[site name]: Optional parameter. Script starts from given site. If -o is provided, 'run' command should exit after executing the site.
If -o is not provided then, 'run' command should continue executing respectively.
example:
./run -o idefix.com
Run idefix.com then exits.

./run -l
List the book sites.

./run
Run all of the book sites.

EOF
exit 0
;;
l)
list=1
;;
:)
echo "HOP $OPTARG"
;;
*)
echo "Unknown error occured"
;;
esac
done

for line in $($crawl_list); do
if [[ $line =~ $pattern ]]; then
if [ -n "$site" ]; then
if [ "$site" != "$line" ]; then
continue;
fi
fi
echo "Book Site: '$line'"
if [ $list -eq 0 ]; then
echo "Crawling started . . ."
$($crawl_exec $line)
echo "Crawling done"
fi
fi
done
Loading