File:Heaps' Law on "War and Peace".svg

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Original file(SVG file, nominally 662 × 491 pixels, file size: 156 KB)

Captions

Captions

Add a one-line explanation of what this file represents

Summary[edit]

Description
English: Verification of Heaps' law on War and Peace.

```python import nltk import urllib.request from collections import Counter import matplotlib.pyplot as plt import numpy as np

  1. Download the corpus

url = "http://www.gutenberg.org/files/2600/2600-0.txt" response = urllib.request.urlopen(url) long_txt = response.read().decode('utf8')

import random

  1. Tokenize the text

tokenizer = nltk.tokenize.RegexpTokenizer('\w+') tokens = tokenizer.tokenize(long_txt.lower()) tokens = tokens[940:]

  1. Prepare arrays to hold the counts of total words and unique words

total_words = np.arange(1, len(tokens) + 1) unique_words = np.zeros(len(tokens))

  1. Count unique words while progressing through the text

word_set = set() for i, token in enumerate(tokens):

   word_set.add(token)
   unique_words[i] = len(word_set)
  1. Fit Heap's law: unique_words = K * total_words ^ beta

log_total_words = np.log(total_words) log_unique_words = np.log(unique_words) beta, logK = np.polyfit(log_total_words, log_unique_words, 1) K = np.exp(logK)

  1. Print the estimated parameters

print('K:', K) print('beta:', beta)

  1. Plot total words vs. unique words

plt.figure(figsize=(8, 6)) plt.plot(total_words, unique_words, label='Empirical Data') plt.plot(total_words, K * total_words ** beta, '--', label=f'Heaps\' Law Fit: K={K:.2f}, beta={beta:.2f}')

  1. Tokenize the text

tokenizer = nltk.tokenize.RegexpTokenizer('\w+') tokens = tokenizer.tokenize(long_txt.lower()) tokens = tokens[940:] random.shuffle(tokens)

  1. Prepare arrays to hold the counts of total words and unique words

total_words = np.arange(1, len(tokens) + 1) unique_words = np.zeros(len(tokens))

  1. Count unique words while progressing through the text

word_set = set() for i, token in enumerate(tokens):

   word_set.add(token)
   unique_words[i] = len(word_set)
  1. Fit Heap's law: unique_words = K * total_words ^ beta

log_total_words = np.log(total_words) log_unique_words = np.log(unique_words) beta, logK = np.polyfit(log_total_words, log_unique_words, 1) K = np.exp(logK)

  1. Print the estimated parameters

print('K:', K) print('beta:', beta)

  1. Plot total words vs. unique words

plt.plot(total_words, unique_words, label='Shuffled Empirical Data') plt.plot(total_words, K * total_words ** beta, '--', label=f'Heaps\' Law Fit for shuffled data: K={K:.2f}, beta={beta:.2f}')


plt.xlabel('Total Words') plt.ylabel('Unique Words') plt.legend() plt.grid(True) plt.title('Verification of Heaps\' Law on "War and Peace"') plt.savefig("war and peace.svg", bbox_inches='tight', format='svg')

```
Date
Source Own work
Author Cosmia Nebula

Licensing[edit]

I, the copyright holder of this work, hereby publish it under the following license:
w:en:Creative Commons
attribution share alike
This file is licensed under the Creative Commons Attribution-Share Alike 4.0 International license.
You are free:
  • to share – to copy, distribute and transmit the work
  • to remix – to adapt the work
Under the following conditions:
  • attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.

File history

Click on a date/time to view the file as it appeared at that time.

Date/TimeThumbnailDimensionsUserComment
current22:54, 18 July 2023Thumbnail for version as of 22:54, 18 July 2023662 × 491 (156 KB)Cosmia Nebula (talk | contribs)Uploaded while editing "Heaps' law" on en.wikipedia.org

There are no pages that use this file.

File usage on other wikis

The following other wikis use this file:

Metadata