Tips for generating skipgrams utilizing Python

When it comes to skipgrams, they are considered to be ngrams that encompass all ngrams and include each (k-i)skipgram until (k-i)==0 (which covers 0 skip grams). So, the question arises: how can one efficiently calculate these skipgrams in Python?

Below is a snippet of code that was attempted, but unfortunately did not produce the desired results:

<pre>
    input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
    def find_skipgrams(input_list, N,K):
  bigram_list = []
  nlist=[]

  K=1
  for k in range(K+1):
      for i in range(len(input_list)-1):
          if i+k+1<len(input_list):
              nlist=[]
              for j in range(N+1):
                  if i+k+j+1<len(input_list):
                    nlist.append(input_list[i+k+j+1])

          bigram_list.append(nlist)
  return bigram_list

</pre>

The above code snippet may not render correctly, but executing

find_skipgrams(['all', 'this', 'happened', 'more', 'or', 'less'],2,1)
provides the following output:

[['this', 'happened', 'more'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['less']]

The code referenced in the link below also does not yield accurate results: https://github.com/heaven00/skipgram/blob/master/skipgram.py

Executing skipgram_ndarray("What is your name") returns: ['What,is', 'is,your', 'your,name', 'name,', 'What,your', 'is,name']

It's worth noting that "name" is treated as a unigram!

Answer №1

The provided paper by the original poster contains the following string:

Insurgents killed in ongoing fighting

This input yields the following results:

2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}

2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.

A modified version of NLTK's ngrams code can be seen here: (https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383)

[...] (omitted for brevity) 

We can now apply some doctests to validate the example presented in the paper:

[...] (doctest examples and outputs) 

It should be noted that when n+k > len(sequence), the same effect will be observed as running skipgrams(sequence, n, k-1). For instance:

[...] (example output when conditions are met) 

The code ensures n == k but raises an exception if n < k, as indicated by this line:

if n < k:
        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")    

To further explain the function within the code:

[...] (explanation and demonstration of itertools combinations) 

By mapping these indices back to the list of tokens, we can generate skipgrams for the current token and its context+skip window:

[...] (demonstration of generating skipgrams based on token indices) 

This process is repeated for each word in the sequence.

Answer №2

UPDATED

The NLTK version 3.2.5 now includes the skipgrams function.

Check out a cleaner implementation by @jnothman on the NLTK repository: https://github.com/nltk/nltk/blob/develop/nltk/util.py#L538

def skipgrams(sequence, n, k, **kwargs):
    """
    This function generates all possible skipgrams from a sequence with specified parameters.
    Skipgrams are essentially n-grams that allow tokens to be skipped.
    Reference: http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
    Parameters:
    - sequence: input data for generating skipgrams
    - n: degree of the n-grams
    - k: skip distance between tokens
    Return type: iterator yielding tuples
    """

    # Pad sequence based on kwargs if required.
    if 'pad_left' in kwargs or 'pad_right' in kwargs:
    sequence = pad_sequence(sequence, n, **kwargs)

    SENTINEL = object()  # Stop sentinel used during iteration.
    for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL):
    head = ngram[:1]
    tail = ngram[1:]
    for skip_tail in combinations(tail, n - 1):
        if skip_tail[-1] is SENTINEL:
            continue
        yield head + skip_tail

[out]:

>>> from nltk.util import skipgrams
>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]

Answer №3

Embarking on a different path from your current code and opting for an external library, you can utilize Colibri Core () for efficient skipgram extraction. This library is designed specifically for extracting n-grams and skipgrams swiftly from large text corpora. While the underlying code is in C++ for performance reasons, there is also a Python binding available.

You rightly highlighted the importance of efficiency, especially when dealing with skipgram extraction that can exhibit exponential complexity. This may not be concerning if you are processing just one sentence as demonstrated in your input_list, but it could pose challenges when working with extensive corpus data. To address this, setting parameters like an occurrence threshold or requiring each skip to be filled by at least x distinct n-grams can help mitigate the issue.

import colibricore

#Prepare the corpus data (will be encoded for efficiency)
corpusfile_plaintext = "somecorpus.txt" #input file with one sentence per line
encoder = colibricore.ClassEncoder()
encoder.build(corpusfile_plaintext)
corpusfile = "somecorpus.colibri.dat" #output corpus file
classfile = "somecorpus.colibri.cls" #output of class encoding
encoder.encodefile(corpusfile_plaintext,corpusfile)
encoder.save(classfile)

#Specify options for skipgram extraction (mintokens defines the occurrence threshold, maxlength sets the maximum ngram/skipgram length)
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True) 

#Create an empty pattern model 
model = colibricore.UnindexedPatternModel()

#Train the model using the encoded corpus file (this triggers skipgram extraction)
model.train(corpusfile, options)

#Load a decoder to view the output
decoder = colibricore.ClassDecoder(classfile)

#Display all skipgrams
for pattern in model:
     if pattern.category() == colibricore.Category.SKIPGRAM:
         print(pattern.tostring(decoder))

A more detailed Python tutorial covering these processes is available on the website.

Disclaimer: I am the developer of Colibri Core

Answer №4

For a comprehensive understanding, please check out this link.

The example provided below illustrates how this function can be utilized effectively.

>>>sent = "Insurgents killed in ongoing fighting".split()
>>>list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]

Answer №5

Why not consider utilizing an existing implementation from someone else, found at https://github.com/heaven00/skipgram/blob/master/skipgram.py, where k = skip_size and n=ngram_order:

def skipgram_ndarray(sent, k=1, n=2):
    """
    This is not exactly a vectorized version, because we are still
    using a for loop
    """
    tokens = sent.split()
    if len(tokens) < k + 2:
        raise Exception("REQ: length of sentence > skip + 2")
    matrix = np.zeros((len(tokens), k + 2), dtype=object)
    matrix[:, 0] = tokens
    matrix[:, 1] = tokens[1:] + ['']
    result = []
    for skip in range(1, k + 1):
        matrix[:, skip + 1] = tokens[skip + 1:] + [''] * (skip + 1)
    for index in range(1, k + 2):
        temp = matrix[:, 0] + ',' + matrix[:, index]
        map(result.append, temp.tolist())
    limit = (((k + 1) * (k + 2)) / 6) * ((3 * n) - (2 * k) - 6)
    return result[:limit]

def skipgram_list(sent, k=1, n=2):
    """
    Form skipgram features using list comprehensions
    """
    tokens = sent.split()
    tokens_n = ['''tokens[index + j + {0}]'''.format(index)
                for index in range(n - 1)]
    x = '(tokens[index], ' + ', '.join(tokens_n) + ')'
    query_part1 = 'result = [' + x + ' for index in range(len(tokens))'
    query_part2 = ' for j in range(1, k+2) if index + j + n < len(tokens)]'
    exec(query_part1 + query_part2)
    return result

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Executing tasks concurrently in Snakemake rule

Apologies if this question seems basic, but I'm still grappling with the complexities of Snakemake. I have a folder full of files that I need to process in parallel using a rule (i.e. running the same script on different input files simultaneously). ...

Discovering all information within a defined timeframe

I'm currently working on an iPython notebook where I aim to display a graph of data points within a specific timeframe. Most of the coding is complete, but I need assistance in enabling the selection of a start and end date, after which the program wi ...

"Discover the seamless way to complement a pandas dataframe with new records on-the-fly using Python

df: Month Product Start_Date End_Date Updated_on 0 January Beverage 01/01/2020 01/31/2020 02/06/2020 1 February Beverage 02/01/2020 02/29/2020 03/06/2020 2 March Beverage 03/01/2020 03/31/2020 04/06/2020 3 April Beverage 04/01/2020 ...

Searching through the directory for all available files

I am facing an issue with reading files from a specific directory. The directory path is obtained from tkinter using the get() method. However, when I run my code, I encounter the following error even though the file exists: FileNotFoundError: [Error 2] ...

Clock Synchronization System

Looking to retrieve the time using the Time protocol specified in RFC 868 while working with python. Below is the code snippet I am using for this purpose: import socket server = "time.nist.gov" port = 37 receive_buffer_size = 4096 mysocket = socket.sock ...

Transferring a list from one variable to another variable within a class in Python

I'm struggling to figure out why my list isn't being returned to the class and staying there after I call my second function. Here's my code snippet: class ShockAbsorber: '''create Proxy, based of 2 locators''&a ...

Decrease the index's level

After running a pivot table, I have the result below indicating the customer grades that visited my stores. Using the 'droplevel' method, I managed to flatten the column header into one layer. Now, I am looking to do the same for the index - remo ...

Using Selenium in Python, discover the exact date and time a Youtube video was posted

I'm currently working on a Python script to navigate to a URL displaying new YouTube videos, with the intention of identifying a video uploaded "3 minutes ago" and then outputting something in my console. Check out an example here I came across this ...

Display special characters in Python interpreter

Currently, I am experimenting with manipulating unicode in my Python project. I am facing issues when trying to print (display) unicode characters like é. Here is what I have attempted so far: >>> sys.setdefaultencoding('UTF8') >> ...

Discovering changing text on a website using Python's Selenium WebDriver

Website myList = ['Artik'] How do I verify if the content of myList is visible on the mentioned website? <span class="ipo-TeamStack_TeamWrapper">Artik<span> This specific webelement showcases 'Artik' in the given websi ...

If I terminate the terminal from which I initiated the PostgreSQL population process, will it cause the process to stop running?

Recently, I created a database named 'enws' and have been using a Python script called 'db_store.py' to populate the DB with data from a file titled 'xSparse.txt'. As someone who is new to Linux and PostgreSQL, I'm curiou ...

Using an HTML form to gather data, you can store the information in a SQL database using SQLAlchemy in a

Struggling to extract a list of values from an HTML form using Flask and store them in a database column. Attempted to use getlist() method but unsure about the correct way to implement it. Database: association_table = Table('association', Bas ...

Exploring the implementation of computed properties in linkml and accessing the output during instantiation

I have developed a linkml model called PersonModel: id: PersonModel name: Person Model prefixes: - prefix: person uri: http://example.org/person/ default_prefix: person types: - name: Person description: A person ...

Optimizing output and preventing repetition in Selenium with Python

New to the world of Python and Selenium, I am eager to extract specific data points through web scraping. Currently facing three challenges: Unable to efficiently loop through multiple URLs Puzzled by the script's repeated iteration over each URL Con ...

Python class initialization for computing cumulative style metrics from data table entries

I am currently working on finding the optimal value of Z within a data table using Python. The ideal Z value is determined when there is a difference of more than 10 in Y values. As part of my code implementation, I am categorizing each entry into a specif ...

Error: Unclear data cardinality detected - x has 12000 sizes while y has 640 sizes. Ensure that all arrays consist of an equal number of samples

When attempting to train the dataset using tensorflow, I encountered an error related to Data Cardinality. *code import numpy as np import os from PIL import Image import matplotlib.pyplot as plt import cv2 from sklearn.model_selection import train_test ...

Breaking down csv column into multiple subcolumns using numpy data type and conversion methods

Currently, I have a csv file that contains various columns with measured values along with their error values. My goal is to import this data into Python using numpy's genfromtxt and format the array using dtype. Suppose the csv file is structured lik ...

Automating web page login with Python Selenium

Recently I started working with Python and Selenium to create an automated login script for checking appointment availability. I have encountered a problem with my code where it throws an exception at the "Button click failed" line when run without breakpo ...

Why is it that my upvote button seems to only function properly for the initial post?

Following the guidelines from tangowithdjango, I implemented a 'like' button in my forum app. The HTML file and Javascript: <script> $(document).ready(function(){ $("#upvote").click(function(){ var postid; postid = $(this).att ...

Creating a mouse locator program in Python: A step-by-step guide to developing a tool similar to the Free Utility for tracking mouse cursor position

Looking to create a mouse locator program using Python? (similar to the Free Utility that locates mouse cursor position) I need help displaying the coordinates in the window as the mouse moves. import tkinter as tk import pyautogui as pag win = tk.Tk() ...