File:Lotka law for the 15 most populated categories on arXiv (2023-07).svg

Original file ‎(SVG file, nominally 1,800 × 1,440 pixels, file size: 351 KB)

Captions

English

Lotka law for the 15 most populated categories on arXiv (2023-07).

Summary

Description	English: Lotka law for the 15 most populated categories on arXiv, as of 2023-07. It is a log-log plot. The x-axis is the number of publications, and the y-axis is the number of authors with at least that many publications. Data from https://www.kaggle.com/datasets/Cornell-University/arxiv ```python import pandas as pd import json Define empty lists to store data categories_list = [] authors_parsed_list = [] Open the file and iterate over each line with open('arxiv-metadata-oai-snapshot.json', 'r') as file: for line in file: # Parse the JSON string paper = json.loads(line) # Extract the "categories" and "authors_parsed" fields categories = paper.get("categories", "") authors_parsed = paper.get("authors_parsed", []) # Split categories string into list and store categories_list.append(categories.split()) # Store the authors_parsed data authors_parsed_list.append(authors_parsed) Create a DataFrame from the extracted data df = pd.DataFrame({"categories": categories_list, "authors_parsed": authors_parsed_list}) categories_list = [category for categories in df['categories'] for category in categories] unique_categories = set(categories_list) from collections import Counter Flatten the categories column into a single list categories_list = [category for categories in df['categories'] for category in categories] Count the occurrence of each category category_counts = Counter(categories_list) Sort categories by count in descending order sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True) Print the sorted categories and their counts for category, count in sorted_categories: print(f"{category}: {count}") import pandas as pd def count_authors(df, category_list): counter = {} # Filter rows that match the specified categories mask = df['categories'].apply(lambda x: any(category in x for category in category_list)) filtered_df = df[mask] # Flatten the authors_parsed column flattened_authors = [author for authors in filtered_df['authors_parsed'] for author in authors] # Count the occurrences of each author for author in flattened_authors: author_name = author[1] + ' ' + author[0] counter[author_name] = counter.get(author_name, 0) + 1 return counter Flatten the categories column into a single list categories_list = [category for categories in df['categories'] for category in categories] Count the occurrence of each category category_counts = Counter(categories_list) Sort categories by count in descending order sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True) Plot plt.rcParams.update({'font.size': 20}) fig, axs = plt.subplots(figsize=(10, 8)) n_categories = 15 for category, _ in sorted_categories[:n_categories]: result = count_authors(df, [category]) data = pd.DataFrame(np.log(list(result.values())), columns=['x'])["x"] magnitudes = data.sort_values(ascending=False) unique_magnitudes = magnitudes.unique() print(category) # Compute reverse cumulative count of earthquakes for all data cumulative_counts = magnitudes.value_counts().sort_index(ascending=False).cumsum().sort_index() axs.scatter(cumulative_counts.index, np.log10(cumulative_counts.values), label=category, s=3) axs.legend() axs.grid() axs.set_title(f"Lotka law for the {n_categories} most populated categories on arXiv") axs.set_xlabel("Log(publications)") axs.set_ylabel("Log(authors)") plt.show() plt.savefig("out.svg") ```
Date	17 July 2023
Source	Own work
Author	Cosmia Nebula

Licensing

I, the copyright holder of this work, hereby publish it under the following license:

This file is licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

You are free:

to share – to copy, distribute and transmit the work
to remix – to adapt the work

Under the following conditions:

attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.

File history

Click on a date/time to view the file as it appeared at that time.

	Date/Time	Thumbnail	Dimensions	User	Comment
current	01:14, 18 July 2023		1,800 × 1,440 (351 KB)	Cosmia Nebula (talk \| contribs)	Uploaded own work with UploadWizard

You cannot overwrite this file.

File usage on Commons

There are no pages that use this file.

File usage on other wikis

The following other wikis use this file:

Usage on en.wikipedia.org
- Lotka's law

Metadata

This file contains additional information such as Exif metadata which may have been added by the digital camera, scanner, or software program used to create or digitize it. If the file has been modified from its original state, some details such as the timestamp may not fully reflect those of the original file. The timestamp is only as accurate as the clock in the camera, and it may be completely wrong.

Width	1440pt
Height	1152pt

Structured data

File:Lotka law for the 15 most populated categories on arXiv (2023-07).svg

Captions

Captions

Summary

Licensing

File history

File usage on Commons

File usage on other wikis

Metadata

Structured data

Items portrayed in this file

depicts

Lotka's law

scientometrics

power law

creator

some value

copyright status

copyrighted

copyright license

Creative Commons Attribution-ShareAlike 4.0 International

source of file

original creation by uploader

inception

17 July 2023

media type

image/svg+xml

Navigation menu

File:Lotka law for the 15 most populated categories on arXiv (2023-07).svg

Captions

Captions

Summary

Licensing

File history

File usage on Commons

File usage on other wikis

Metadata

Navigation menu

Search