Tips for generating skipgrams utilizing Python

Question

Tips for generating skipgrams utilizing Python

When it comes to skipgrams, they are considered to be ngrams that encompass all ngrams and include each (k-i)skipgram until (k-i)==0 (which covers 0 skip grams). So, the question arises: how can one efficiently calculate these skipgrams in Python?

Below is a snippet of code that was attempted, but unfortunately did not produce the desired results:

<pre>
    input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
    def find_skipgrams(input_list, N,K):
  bigram_list = []
  nlist=[]

  K=1
  for k in range(K+1):
      for i in range(len(input_list)-1):
          if i+k+1<len(input_list):
              nlist=[]
              for j in range(N+1):
                  if i+k+j+1<len(input_list):
                    nlist.append(input_list[i+k+j+1])

          bigram_list.append(nlist)
  return bigram_list

</pre>

The above code snippet may not render correctly, but executing

find_skipgrams(['all', 'this', 'happened', 'more', 'or', 'less'],2,1)

provides the following output:

[['this', 'happened', 'more'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['less']]

The code referenced in the link below also does not yield accurate results: https://github.com/heaven00/skipgram/blob/master/skipgram.py

Executing skipgram_ndarray("What is your name") returns: ['What,is', 'is,your', 'your,name', 'name,', 'What,your', 'is,name']

It's worth noting that "name" is treated as a unigram!

python nlp n-gram language-model

Answer 1

Answer №1

The provided paper by the original poster contains the following string:

Insurgents killed in ongoing fighting

This input yields the following results:

2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}

2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.

A modified version of NLTK's ngrams code can be seen here: (https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383)

[...] (omitted for brevity)

We can now apply some doctests to validate the example presented in the paper:

[...] (doctest examples and outputs)

It should be noted that when n+k > len(sequence), the same effect will be observed as running skipgrams(sequence, n, k-1). For instance:

[...] (example output when conditions are met)

The code ensures n == k but raises an exception if n < k, as indicated by this line:

if n < k:
        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")

To further explain the function within the code:

[...] (explanation and demonstration of itertools combinations)

By mapping these indices back to the list of tokens, we can generate skipgrams for the current token and its context+skip window:

[...] (demonstration of generating skipgrams based on token indices)

This process is repeated for each word in the sequence.

Answer 2

The provided paper by the original poster contains the following string:

Insurgents killed in ongoing fighting

This input yields the following results:

2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}

2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.

A modified version of NLTK's ngrams code can be seen here: (https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383)

[...] (omitted for brevity)

We can now apply some doctests to validate the example presented in the paper:

[...] (doctest examples and outputs)

It should be noted that when n+k > len(sequence), the same effect will be observed as running skipgrams(sequence, n, k-1). For instance:

[...] (example output when conditions are met)

The code ensures n == k but raises an exception if n < k, as indicated by this line:

if n < k:
        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")

To further explain the function within the code:

[...] (explanation and demonstration of itertools combinations)

By mapping these indices back to the list of tokens, we can generate skipgrams for the current token and its context+skip window:

[...] (demonstration of generating skipgrams based on token indices)

This process is repeated for each word in the sequence.

Answer 3

Answer №2

UPDATED

The NLTK version 3.2.5 now includes the skipgrams function.

Check out a cleaner implementation by @jnothman on the NLTK repository: https://github.com/nltk/nltk/blob/develop/nltk/util.py#L538

def skipgrams(sequence, n, k, **kwargs):
    """
    This function generates all possible skipgrams from a sequence with specified parameters.
    Skipgrams are essentially n-grams that allow tokens to be skipped.
    Reference: http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
    Parameters:
    - sequence: input data for generating skipgrams
    - n: degree of the n-grams
    - k: skip distance between tokens
    Return type: iterator yielding tuples
    """

    # Pad sequence based on kwargs if required.
    if 'pad_left' in kwargs or 'pad_right' in kwargs:
    sequence = pad_sequence(sequence, n, **kwargs)

    SENTINEL = object()  # Stop sentinel used during iteration.
    for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL):
    head = ngram[:1]
    tail = ngram[1:]
    for skip_tail in combinations(tail, n - 1):
        if skip_tail[-1] is SENTINEL:
            continue
        yield head + skip_tail

[out]:

>>> from nltk.util import skipgrams
>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]

Answer 4

UPDATED

The NLTK version 3.2.5 now includes the skipgrams function.

Check out a cleaner implementation by @jnothman on the NLTK repository: https://github.com/nltk/nltk/blob/develop/nltk/util.py#L538

def skipgrams(sequence, n, k, **kwargs):
    """
    This function generates all possible skipgrams from a sequence with specified parameters.
    Skipgrams are essentially n-grams that allow tokens to be skipped.
    Reference: http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
    Parameters:
    - sequence: input data for generating skipgrams
    - n: degree of the n-grams
    - k: skip distance between tokens
    Return type: iterator yielding tuples
    """

    # Pad sequence based on kwargs if required.
    if 'pad_left' in kwargs or 'pad_right' in kwargs:
    sequence = pad_sequence(sequence, n, **kwargs)

    SENTINEL = object()  # Stop sentinel used during iteration.
    for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL):
    head = ngram[:1]
    tail = ngram[1:]
    for skip_tail in combinations(tail, n - 1):
        if skip_tail[-1] is SENTINEL:
            continue
        yield head + skip_tail

[out]:

>>> from nltk.util import skipgrams
>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]

Answer 5

Answer №3

Embarking on a different path from your current code and opting for an external library, you can utilize Colibri Core () for efficient skipgram extraction. This library is designed specifically for extracting n-grams and skipgrams swiftly from large text corpora. While the underlying code is in C++ for performance reasons, there is also a Python binding available.

You rightly highlighted the importance of efficiency, especially when dealing with skipgram extraction that can exhibit exponential complexity. This may not be concerning if you are processing just one sentence as demonstrated in your input_list, but it could pose challenges when working with extensive corpus data. To address this, setting parameters like an occurrence threshold or requiring each skip to be filled by at least x distinct n-grams can help mitigate the issue.

import colibricore

#Prepare the corpus data (will be encoded for efficiency)
corpusfile_plaintext = "somecorpus.txt" #input file with one sentence per line
encoder = colibricore.ClassEncoder()
encoder.build(corpusfile_plaintext)
corpusfile = "somecorpus.colibri.dat" #output corpus file
classfile = "somecorpus.colibri.cls" #output of class encoding
encoder.encodefile(corpusfile_plaintext,corpusfile)
encoder.save(classfile)

#Specify options for skipgram extraction (mintokens defines the occurrence threshold, maxlength sets the maximum ngram/skipgram length)
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True) 

#Create an empty pattern model 
model = colibricore.UnindexedPatternModel()

#Train the model using the encoded corpus file (this triggers skipgram extraction)
model.train(corpusfile, options)

#Load a decoder to view the output
decoder = colibricore.ClassDecoder(classfile)

#Display all skipgrams
for pattern in model:
     if pattern.category() == colibricore.Category.SKIPGRAM:
         print(pattern.tostring(decoder))

A more detailed Python tutorial covering these processes is available on the website.

Disclaimer: I am the developer of Colibri Core

Answer 6

Embarking on a different path from your current code and opting for an external library, you can utilize Colibri Core () for efficient skipgram extraction. This library is designed specifically for extracting n-grams and skipgrams swiftly from large text corpora. While the underlying code is in C++ for performance reasons, there is also a Python binding available.

You rightly highlighted the importance of efficiency, especially when dealing with skipgram extraction that can exhibit exponential complexity. This may not be concerning if you are processing just one sentence as demonstrated in your input_list, but it could pose challenges when working with extensive corpus data. To address this, setting parameters like an occurrence threshold or requiring each skip to be filled by at least x distinct n-grams can help mitigate the issue.

import colibricore

#Prepare the corpus data (will be encoded for efficiency)
corpusfile_plaintext = "somecorpus.txt" #input file with one sentence per line
encoder = colibricore.ClassEncoder()
encoder.build(corpusfile_plaintext)
corpusfile = "somecorpus.colibri.dat" #output corpus file
classfile = "somecorpus.colibri.cls" #output of class encoding
encoder.encodefile(corpusfile_plaintext,corpusfile)
encoder.save(classfile)

#Specify options for skipgram extraction (mintokens defines the occurrence threshold, maxlength sets the maximum ngram/skipgram length)
options = colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True) 

#Create an empty pattern model 
model = colibricore.UnindexedPatternModel()

#Train the model using the encoded corpus file (this triggers skipgram extraction)
model.train(corpusfile, options)

#Load a decoder to view the output
decoder = colibricore.ClassDecoder(classfile)

#Display all skipgrams
for pattern in model:
     if pattern.category() == colibricore.Category.SKIPGRAM:
         print(pattern.tostring(decoder))

A more detailed Python tutorial covering these processes is available on the website.

Disclaimer: I am the developer of Colibri Core

Answer 7

Answer №4

For a comprehensive understanding, please check out this link.

The example provided below illustrates how this function can be utilized effectively.

>>>sent = "Insurgents killed in ongoing fighting".split()
>>>list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]

Answer 8

For a comprehensive understanding, please check out this link.

The example provided below illustrates how this function can be utilized effectively.

>>>sent = "Insurgents killed in ongoing fighting".split()
>>>list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]

Answer 9

Answer №5

Why not consider utilizing an existing implementation from someone else, found at https://github.com/heaven00/skipgram/blob/master/skipgram.py, where k = skip_size and n=ngram_order:

def skipgram_ndarray(sent, k=1, n=2):
    """
    This is not exactly a vectorized version, because we are still
    using a for loop
    """
    tokens = sent.split()
    if len(tokens) < k + 2:
        raise Exception("REQ: length of sentence > skip + 2")
    matrix = np.zeros((len(tokens), k + 2), dtype=object)
    matrix[:, 0] = tokens
    matrix[:, 1] = tokens[1:] + ['']
    result = []
    for skip in range(1, k + 1):
        matrix[:, skip + 1] = tokens[skip + 1:] + [''] * (skip + 1)
    for index in range(1, k + 2):
        temp = matrix[:, 0] + ',' + matrix[:, index]
        map(result.append, temp.tolist())
    limit = (((k + 1) * (k + 2)) / 6) * ((3 * n) - (2 * k) - 6)
    return result[:limit]

def skipgram_list(sent, k=1, n=2):
    """
    Form skipgram features using list comprehensions
    """
    tokens = sent.split()
    tokens_n = ['''tokens[index + j + {0}]'''.format(index)
                for index in range(n - 1)]
    x = '(tokens[index], ' + ', '.join(tokens_n) + ')'
    query_part1 = 'result = [' + x + ' for index in range(len(tokens))'
    query_part2 = ' for j in range(1, k+2) if index + j + n < len(tokens)]'
    exec(query_part1 + query_part2)
    return result

Answer 10

Why not consider utilizing an existing implementation from someone else, found at https://github.com/heaven00/skipgram/blob/master/skipgram.py, where k = skip_size and n=ngram_order:

def skipgram_ndarray(sent, k=1, n=2):
    """
    This is not exactly a vectorized version, because we are still
    using a for loop
    """
    tokens = sent.split()
    if len(tokens) < k + 2:
        raise Exception("REQ: length of sentence > skip + 2")
    matrix = np.zeros((len(tokens), k + 2), dtype=object)
    matrix[:, 0] = tokens
    matrix[:, 1] = tokens[1:] + ['']
    result = []
    for skip in range(1, k + 1):
        matrix[:, skip + 1] = tokens[skip + 1:] + [''] * (skip + 1)
    for index in range(1, k + 2):
        temp = matrix[:, 0] + ',' + matrix[:, index]
        map(result.append, temp.tolist())
    limit = (((k + 1) * (k + 2)) / 6) * ((3 * n) - (2 * k) - 6)
    return result[:limit]

def skipgram_list(sent, k=1, n=2):
    """
    Form skipgram features using list comprehensions
    """
    tokens = sent.split()
    tokens_n = ['''tokens[index + j + {0}]'''.format(index)
                for index in range(n - 1)]
    x = '(tokens[index], ' + ', '.join(tokens_n) + ')'
    query_part1 = 'result = [' + x + ' for index in range(len(tokens))'
    query_part2 = ' for j in range(1, k+2) if index + j + n < len(tokens)]'
    exec(query_part1 + query_part2)
    return result

Tips for generating skipgrams utilizing Python

Answer №1

Answer №2

UPDATED

Answer №3

Answer №4

Answer №5

Similar questions

Executing tasks concurrently in Snakemake rule

Discovering all information within a defined timeframe

"Discover the seamless way to complement a pandas dataframe with new records on-the-fly using Python

Searching through the directory for all available files

Clock Synchronization System

Transferring a list from one variable to another variable within a class in Python

Decrease the index's level

Using Selenium in Python, discover the exact date and time a Youtube video was posted

Display special characters in Python interpreter

Discovering changing text on a website using Python's Selenium WebDriver

If I terminate the terminal from which I initiated the PostgreSQL population process, will it cause the process to stop running?

Using an HTML form to gather data, you can store the information in a SQL database using SQLAlchemy in a

Exploring the implementation of computed properties in linkml and accessing the output during instantiation

Optimizing output and preventing repetition in Selenium with Python

Python class initialization for computing cumulative style metrics from data table entries

Error: Unclear data cardinality detected - x has 12000 sizes while y has 640 sizes. Ensure that all arrays consist of an equal number of samples

Breaking down csv column into multiple subcolumns using numpy data type and conversion methods

Automating web page login with Python Selenium

Why is it that my upvote button seems to only function properly for the initial post?

Creating a mouse locator program in Python: A step-by-step guide to developing a tool similar to the Free Utility for tracking mouse cursor position