What is the best way to annotate specific portions of the cumulative total in matplotlib?

I am working on creating a basic histogram using matplotlib in Python.

The histogram will display the distribution of comment lengths based on several thousand comments. Here is the code I have so far:

x = [60, 55, 2, 30, ..., 190]

plt.hist(x, bins=100)
plt.xlim(0,150)
plt.grid(axis="x")
plt.title("Distribution of Comment Lengths")
plt.xlabel("Tokens/Comment")
plt.ylabel("Amount of Comments")
plt.show()

One thing I would like to add is a visual indicator that shows when 50% (or other percentages) of all tokens have been accounted for in the distribution. For example, a vertical line separating the data into two halves with an equal amount of tokens on each side.

Is there an easy way to accomplish this using matplotlib?

Thank you for your assistance!

Answer №1

If you want to find the x-value that corresponds to p% of all comments, simply sort the list of values and then locate it at the p% position based on the total length. Adding vertical lines at these positions and including a second x-axis with labels can help visually represent this data.

To determine the x-value for p% of all tokens, identify where the element with a value of p% of the sum of all the x's is situated in the array containing the cumulative sum of the sorted list. This position can be used to index the sorted list of values.

Below is an example code snippet illustrating how this concept could be implemented:

from matplotlib import pyplot as plt
import numpy as np

# Generate random data for testing purposes, convert to a standard Python list for consistency with the original question
x = list(np.abs(np.random.normal(85, 30, 2000)))
wanted_percentiles = [5, 10, 25, 33, 50, 66, 75, 90, 95]
sx = np.array(x)
sx.sort()
cx = sx.cumsum()

percentile_sx = [sx[int(len(x) * p / 100)] for p in wanted_percentiles]
percentile_cx = [sx[cx.searchsorted(cx[-1] * p / 100)] for p in wanted_percentiles]

fig, axes = plt.subplots(ncols=2, figsize=(12, 4))
for ax, percentile, color, title in zip(axes, [percentile_sx, percentile_cx],
                                 ['crimson', 'limegreen'], ['Comments Percentile', 'Tokens Percentile']):
    ax.hist(x, bins=20)
    for xp in percentile:
        ax.axvline(xp, color=color)
    ax2 = ax.twiny()

    ax.set_xlim(0, 150)
    ax2.set_xlim(ax.get_xlim())  # ensure both axes have identical limits
    ax2.set_xticks(percentile)  # assign xs corresponding to percentiles as tick positions
    ax2.set_xticklabels(wanted_percentiles, color=color) # use percentiles for labeling ticks
    ax.set_title("Distribution of Comment Lengths, " + title)
    ax.set_xlabel("Comments binned by number of tokens")
    ax.set_ylabel("Number of Comments")
plt.show()

The plot on the left showcases data with 100 bins, while the one on the right demonstrates how it would appear with 20 bins:

https://i.stack.imgur.com/W2qth.png

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Issue encountered: Module not found error persists even though the module is installed and appears in the list generated by pip

After installing biopython and related modules, I attempted to load them with the following code: from BCBio.GFF import GFFExaminer import pprint from BCBio import GFF However, I encountered the following error: ----------------------------------------- ...

Python: storing whole numbers in a byte-sized space

I'm exploring ways to encode 4 integers into a single byte. In the code snippet below, I manage to unpack \x11 and extract the bits (resulting in: 1 2 0 0). But how can I reverse this process? In other words, how do I pack 1 2 0 0 back into &bso ...

Error: The file or directory 'geckodriver' cannot be found for a basic Python Selenium program

I am currently experimenting with a basic selenium example on a Linux system: from selenium import webdriver from selenium.webdriver.common.keys import Keys driver = webdriver.Firefox() driver.get("something") But I encountered an error: FileNotFoundEr ...

Instead of validating the entire column, the value is verified in a single cell

While working on my code, I encountered a problem. I want the code to check if a cell contains a value between 0.2 and 2.0 (0.2 < x < 2.0), and if true, it should print a word for each cell. Currently, my code runs without errors but does not function as ...

What is the method for incorporating a main title to a matplotlib figure containing subplots?

It appears that you have the capability to include a title for each subplot, as explained in this discussion: How to add title to subplots in Matplotlib? Is there a method to incorporate an overarching title in addition to the individual subplot titles? ...

Verify the compatibility of code with Python versions 3.x

In my Physics course, I am developing a Python script to visualize the progression of a system we are examining. My code is written in Python 3.6 or higher. While sharing my code with classmates during a previous assignment, I encountered an issue. Having ...

Calculating the edit distance between two fields in a pandas dataframe

I am working with a pandas DataFrame that has two columns of strings. My goal is to add a third column which will calculate the Edit Distance between the values in the first two columns. from nltk.metrics import edit_distance df['edit'] = ed ...

Incorporating positional adjustments to data points

Currently, I am trying to find a method to specify an X-Y offset for plotted data points. As I am new to Altair, I appreciate your patience. Here is the scenario: I have a dataset that records daily measurements for 30 individuals. Each person can log mul ...

Display contents from a JSON file

I need to extract unique locality values from a JSON file in Python, but my current code is only printing the first few entries and not iterating through all 538 elements. Here's what I have: import pandas as pd import json with open('data.json ...

"Troubleshooting blurriness in images due to Tkinter canvas scaling

After implementing a Tkinter canvas object with a zoom feature, based on guidance from another post on Stack Exchange and making some modifications, I encountered an issue. The zoom function in question is as follows: def wheel(self, event): ' ...

Guide for adding Oracle Client library to cPanel

I have created a Python application via cPanel and configured the database to connect with Oracle DB on AWS. The application runs perfectly on localhost, but when hosted, I encountered an error stating that the Oracle Client library is missing: Oracle Cli ...

I am having trouble finding the tabFrame frame shown in the screenshot below

https://i.stack.imgur.com/wCJhN.png I'm having trouble finding the frame labeled tabFrame in the screenshot provided. I did manage to locate outlookFrame successfully. This is the code I used: driver.switch_to.frame('outlookFrame') However ...

Analyzing HTML markup to locate an anchor tag

While attempting to parse an HTML code using BeautifulSoup in order to retrieve a link that is hidden and does not appear on the website, I have encountered difficulties as it only retrieves links visible on the page. Is there a method to effectively par ...

Excessive instances of window procedure overwrite can result in crashes without any error messages

I am currently developing an App that utilizes SDL / Pygame for rendering graphics. As part of the optimization process, I have overridden the window procedure to ensure smoother performance when resizing the window (refer to this answer). However, I have ...

There was an issue parsing the parameter '--request-items' due to invalid JSON format. Decoding failed as no JSON object could be found

My current environment includes AWS cloud, DynamoDB, AWS Cloud9, Python, and JSON. I'm attempting to write these elements into a DynamoDB table using the command aws dynamodb batch-write-item --request-items file://Sensors.json in my Cloud9 CLI. How ...

"What is the total count of numbers and letters in the input provided by the user

I have this code snippet as the base of my program: from tkinter import * def click(event=None): global text_input user_text = text_input.get("1.0",'end-1c') # what should I do next? window = Tk() text_input = Text(window, height= ...

Navigating repetitive sections of code using Selenium

<block data-id="1234"> <taga>New York</taga> <tagb>Yankees</tagb> </block> <block data-id="5678"> <taga>Montreal</taga> <tagb>Expos</tagb> </block> <block data-id="2468"> ...

Using Q objects in Django Filters across multiple models

I have been working on a Django Application where I am trying to implement a search function across two models, Profile (fields: surname and othernames) and Account (field: account_number), using Q objects. However, my current implementation is only search ...

Utilize Python (specifically with the Pillow library) to remove transparent backgrounds

When using the python module html2image to convert html to an image, I encountered an issue with fixed image size causing the html to break or get cut off, leaving a blank space. Is there a Python-based solution to fix this problem? (I am using chat export ...

Implementing Python multiprocessing to execute x number of processes simultaneously

My current project involves multiple python functions, each belonging to a different class. I want to be able to run x of these functions in parallel at any given time, unless there are fewer than x remaining. Essentially, I am looking to create a queue o ...