lingu log

Stop-word removal using NLTK

stop-words is a crucial task in information retrieval. Most indexing mechanisms include this step as an essential part since it will improve the performance of the information retrieval process. Stop-words are mostly very frequent words like which, that, a that have no distinct meaning. Luckily, NLTK has a list of existing stop-words that can be used for filtering stop-words. This corpus can be downloaded via nltk.download('stopwords') there is also a parameter for the function: stop_word_removal where you can set the data_download to True or False. Sample output from sample Craiglist dataset:

'water', 'bottle', 'never', 'added', 'bike.', 'call', 'interested', 'meet', 'location', 'people', 'like'

Github gist:

Removing Proper Nouns

A really crucial method for corpus pre-processing is to remove words with certain pos tags for this Github gist we remove NNP and NNPS tags:

Removing OOVs

One of the main tasks when building a corpus is to remove out-of-vocabulary words. These words are mostly proper nouns and used very infrequently. For this task, it can be done via a Python counter. The default threshold that I have put in the code is 100. The output for 1000 and above frequent words for sample Craigslist dataset is:

{'': 26506, '\n': 1451, 'to': 2304, 'a': 1313, 'the': 1649, 'and': 2496, 'for': 1032, 'of': 1146}

Github gist:

Scraping 101

In this post we will go through basics of scraping a website and getting data from it. This library uses either a crawl spider to go through pages in a website or given a list of web addresses crawls the web pages. Scrapy supports both xpath and CSS selectors which means using a simple css/xpath selector we can capture required data from the webpage. Instructions on using scrapy

Link to Github repo: https://github.com/adelra/scraping_101

The basic idea is to start with generating a spider using scrapy genspider mydomain mydomain.com Then we will import from scrapy.spiders import CrawlSpider, Rule where the former is our base crawler used in scrapy and the latter is our ruling system. We will talk about that later. Then we will define our items. Items are basically parts of a website we would want to crawl. For instance, we want to extract contact names, phone numbers, and addresses from a website we then have to:

class CrawlerName(scrapy.Item):
name = scrapy.Field()
phone = scrapy.Field()
address = scrapy.Field()

We will then create empty files to write out our output. In cases where there is a pipeline for our data, for instance, scrapy crawler is called by another shell, we can write out the data in std.out().

As it has been written in the Wikipedia’s website, not to crawl wikipedia even for good and/or testing purposes, I will limit the start URL to a small number of pages.

We then have to include allowed URLs in a list to tell scrapy about the URLs and domains we want to crawl. The last step is to pass each item to a css/xpath selector. I personally would rather use regex to clean the data before serializing it on disk. For instance:

item["phone"] = re.sub(r"[a-zA-Z<>\\\"=/:-%!_]*","","".join(response.css(r".EFBIF").extract()).encode('utf-8'))

The regular expression for [a-zA-Z<>\\\"=/:0-9-%!_]* essentially removes most characters but the numbers. A quick guide for the regex is given below:

  • a-z: letters a until z in lowercase format
  • A-Z: letters a until z in lowercase format
  • 0-9: digits 0-9
  • *: repitition
  • []: a series

Now the ideal way to remove HTML tags from our crawled web pages would be <[^>]*>. Also in the example above the re.sub method takes two arguments. Basically, the first one is “What to find?”, the second one is “What to replace it with?”, and the third one is “Where to look for?”. As a sidenote, the r before the regex string tells python to handle it as raw string. In the end, we will either write out our data fp.write() or return it from our function return data_from_crawler. To run the crawler you’d have to point to the folder /⁨a⁩/⁨scrapy⁩/⁨dataset⁩/dataset⁩ and into the command line enter scrapy crawl general or the specific crawler. I have prepared two crawlers one is for Craigslist and the other one is for Wikipedia. You can run either of them.

Or else I have put a simple bash file to do the addressing and running of the file. You can easily run easy-run.sh to run the Wikipedia crawler.

## Regex for Extracting phone numbers Craigslist dataset is used

For this part, I have prepared a function that gets an input of a text file on disk and prints out all the phone numbers inside the file. The regex for finding phone numbers is [+|00| ][0-9.*_-]{10,}

The sample output of extracting phone numbers from Craiglist is given below:

Number of phone numbers collected: 126
list of phone numbers: [' 0060164791359', ' 555-555-1234', ' 540-999-8042', ' 800-220-1177', '02-276-9019', ' 240-505-8508', ' 410-268-6627', ' 703-761-6161', ' 443-736-0907', ' 443-736-0907', ' 443-736-0907', ' 703-577-8070', ' 540-718-5696', '0-423-2114***', ' 904-248-0055', ' 407-252-5605', '00-476-1360***', '07-900-7736', ' 407-319-9973.', ' 407-319-9973', ' 912-389-2792', ' 478-308-1559', ' 407.283.5296', ' 813.728.6120', ' 813.728.6120', ' 727-264-6560', '+91-9620111613', '02-345-5435', ' 601-549-1224', ' 479-616-2034', ' 601-549-1224', ' 479-616-2034', ' 601-549-1224', ' 601-549-1224', ' 479-616-2034', ' 601-549-1224', ' 479-616-2034', ' 601-549-1224', ' 479-616-2034', ' 601-549-1224', ' 601-549-1224', ' 479-616-2034', ' 479-402-6755', ' 479-616-2034', ' 479-402-6755', ' 479-616-2034', ' 479-402-6755', ' 479-402-6755', ' 479-616-2034', ' 479-402-6755', ' 479-616-2034', ' 479-402-6755', ' 479-616-2034', ' 479-402-6755', ' 479-402-6755', ' 479-616-2034', ' 203-685-0346', ' 203-685-0346', ' 850-970-00', ' 6782789774', ' 6782789774', ' 706-499-4849', ' 912-257-3683', ' 4046717775', ' 6782789774', ' 6782789774', ' 850-543-1914', ' 850-543-1914', ' 1-800-273-6103', ' 1-800-273-6103', ' 1-800-273-6103', ' 504-559-6612', '0062-70065-70121-70123-70001-70002-70003-70005-70006-', '0112.70113.70114.70115.70116.70117.', '0118.70119.70112.70122.70124.70125.', '0126.70130.', ' ------------', ' ************', ' 25-626-1637.', ' 10000-20000', ' 11-2-18.105', ' ..........', ' 856-776-31-5', ' 609-829-8873', ' 1-800-505-5423', ' 1-800-505-5423', ' 1-800-505-5423', ' 1-800-505-5423', ' 8562137860', ' 609-790-9582', ' 1-800-505-5423', ' 1-800-505-5423', ' 1-800-505-5423', ' 1-800-505-5423', ' 215-800-7515', ' 856-881-3663', ' 760-729-1877', ' 216-703-4719', ' 216-534-7017', ' 619-233-8300', ' 916-448-6715', ' 888-888-7122', ' 888-888-7122', ' 888-888-7122', ' 760-729-1877', ' 1-866-435-0844', ' 1-866-435-0844', ' 760-230-6635', ' 818-207-5290', ' 951-742-9079.', ' 951-742-9079', ' 951-742-9079.', ' 951-742-9079.', ' 310-704-6446', ' 323-677-5102', ' 323-677-5102', '00-920-0016', ' 818-207-5290', ' 818-207-5290', ' ________________________________________________________________', '09-938-1212', ' 951-320-1390', ' 951-320-1390', ' 1-866-879-2829', ' 213-399-6303', ' 323-220-6248']

Language Identification

Language Identification is a method of determining the language given a string. There are numerous methods for language identification but in general, this task is categorized as a sequence labeling task since we are going to label a sequence and our labels are different languages. For instance, a method for language identification using neural networks can be seen at github.com/adelra/langident

Considering the fact that training a neural network and on top of that preparing the dataset for the network to train on, is a time-consuming task we will stick to existing methods.

Langdetect: Langdetect is a very lightweight and fast library which detects languages instantly.

from langdetect import detect
>>> detect("I am bigger than you")
'en'
>>> detect("Bonjour, mademoiselle")
'fr'

The results from the language detector for Craigslist extracted dataset comes as below:

{'fr': 155, 'en': 5179, 'no': 65, 'et': 28, 'id': 52, 'af': 71, 'de': 321, 'da': 73, 'vi': 53, 'ca': 99, 'nl': 91, 'tl': 44, 'sv': 32, 'it': 118, 'so': 43, 'sl': 6, 'hu': 25, 'pt': 47, 'cy': 32, 'hr': 3, 'es': 67, 'sw': 7, 'ro': 64, 'fi': 6, 'pl': 19, 'tr': 17, 'lt': 9, 'lv': 8, 'sq': 5, 'zh-cn': 12, 'cs': 2, 'sk': 6}

Github gist: