Lexandria Dataset

Any AI, Analytics or Data project is only as good as the data that it’s fed. At Lexandria, we pride ourselves on our ability to vet and qualify the best and most credible web data sources online. Our proprietary technology sources and classifies web data across +130 categories in +180 geographies.

language

Multi-Lingual Support

Our collection of data spans multiple languages, making it versatile for training models that can operate in a global context.

verified_user

High-Quality Content

The inclusion of data from authoritative sources ensures that the information is reliable, accurate, and relevant, adding an extra layer of quality to your trained models.

precision_manufacturing

Industry-Specific Classifications

Our proprietary classification nomenclature allows for targeted training and fine-tuning of models, based on the specific industry or vertical you are interested in.

feed

Comprehensive Metadata

The extensive metadata accompanying each piece of content makes it easier to perform data analysis, selection, and manipulation for specialized model training.

storage

High Volume

Given the size and scope of our dataset, you can train models that are capable of a deep understanding across various domains.

history

Historical Context

Our dataset includes information dated back several years, making it suitable for models that require historical context for accurate predictions and analyses.

auto_mode

Regular Updates

Our dataset is continually updated with the latest content, ensuring that your models stay relevant and timely.

Highlights

  • Over 15 million documents
  • Over 200 thousand video transcripts
  • Over 15 thousand podcast transcripts
  • In total over 70 billion words in the dataset

Data description

Our curation process involves the systematic extraction and analysis of data from rigorously vetted sources. These sources include official entities such as government agencies, regulatory bodies, central banks, statistical offices, international development institutions, renowned consulting firms, law firms and think tanks.
We parse the content of documents and transcribe the audio content from videos and podcasts published by these vetted institutions.
Our dataset is ideal to train or fine-tune large language models for generative or other AI use cases for various industries and verticals.

Timed text data

The timed text data for podcasts and videos can provide nuanced training material for applications that require time-stamped or sequential information. These transcripts are also particularly valuable for training models that specialize in understanding the context and flow of conversations.

Categories

Think Tank & Research Institution 1,686,625 docs
Central Bank 1,368,391 docs
Bureau of Statistics 1,085,203 docs
International Development Organization 957,345 docs
Ministry of Health 767,016 docs
University 614,673 docs
Government Portal 558,541 docs
Stock Exchange 510,528 docs
Finance and Insurance Regulatory Authority 453,072 docs
Ministry of Education 433,616 docs
Download categories as CSV

Regions

Global (GLOBAL) 2,090,676 docs
Japan (JP) 1,861,613 docs
United States of America (US) 1,609,102 docs
Spain (ES) 644,886 docs
Switzerland (CH) 636,512 docs
France (FR) 610,999 docs
Europe (EUROPE) 501,413 docs
India (IN) 486,018 docs
United Kingdom (GB) 469,748 docs
Czechia (CZ) 335,327 docs
Download regions as CSV

Languages

English (eng) 7,177,201 docs
Unrecognized (und) 1,962,872 docs
Spanish (spa) 1,835,225 docs
French (fra) 1,397,541 docs
Japanese (jpn) 1,320,436 docs
German (deu) 774,385 docs
Portuguese (por) 648,458 docs
Italian (ita) 477,306 docs
Dutch (nld) 359,097 docs
Russian (rus) 302,308 docs
Download languages as CSV

Data dictionary

id string

Unique identifier for the document

Example: "f1b9d7c0-1fcb-11ec-9621-0242ac130002"

source string

Source name

Example: "Federal Reserve"

category string

Nomenclature category

Example: "Central Bank"

country string

Country/ region origin of the source

Example: "US"

indexDate timestamp

Crawling date and time

Example: 1669049079000 (Mon, 21 Nov 2022 16:44:39 GMT)

documentDate timestamp

Date and time the document was created (not always available)

Example: 1659629416000 (Thu, 04 Aug 2022 16:10:16 GMT)

type string

Content type (document, video or podcast)

Example: "Document"

fileType string

File type for documents (e.g. pdf, doc, xlsx, …)

Example: "pdf"

language string

Content language (ISO 639-3)

Example: "eng"

title string

Content title

Example: "Demand-Supply imbalance during the Covid-19 pandemic: The role of fiscal policy"

url string

URL of the piece of content

Example: "https://www.federalreserve.gov/econres/ifdp/files/ifdp1353.pdf"

text string

Text for documents and time-stamped transcript for videos and podcasts

Example: "Board of Governors of the Federal Reserve System International Finance Discussion Papers …"

timedText JSON

JSONArray of objects: { text: string, start: number, end: number, speaker?: number }

Example: [ { "text": "Should I vaccinate my children? ", "start": 0, "end": 2.5, "speaker": 1 }, { "text": "Most parents do. ", "start": 2.5, "end": 4.4, "speaker": 2 }, { "text": "However, with so much conflicting information available, ", "start": 4.4, "end": 7.5, "speaker": 2 }, … ]

Explore our sample set

We have prepared a sample set of our data for you to explore. The sample set contains 1,000 podcasts, 2,000 videos and 7,000 documents. Those documents have been sampled randomly within their respective categories.

id title type url indexDate documentDate language source category country fileType text timedText
00000165-c82...
Statistik zur Überschuldung privater Personen - Fachserie 15 Reihe 5 - 2021 - (Letzte Ausgabe, beric
document
https://www.destatis.de/DE/Themen/Gesellschaft-Umwelt/Einkommen-Konsum-Lebensbedingungen/Vermoegen-Schulden/Publikationen/Downloads-Vermoegen-Schulden/ueberschuldung-2150500217005.xlsx?__blob=publicationFile
1/15/2023 5/25/2022 German
Statistisches Bundesamt
Bureau of Statistics
Germany xlsx N/A
0000016c-aae...
Download Report
document
https://euromed-economists.org/download/emea-webinar-report-digital-financial-inclusion-a-pillar-for-resilience-post-covid-19-16-july-2020/?wpdmdl=9758&refresh=641913ca9cba61679365066
3/21/2023 7/29/2020 English
Euro-Mediterranean Economists Association
Think Tank & Research Institution
Global pdf N/A
000002f6-791...
Association of University Centers on Disabilities
document
https://www.dol.gov/sites/dolgov/files/EBSA/laws-and-regulations/rules-and-regulations/public-comments/1210-AB45/00054.pdf
1/22/2023 9/21/2010 English
U.S. Department of Labor
Ministry of Labour
United States of America pdf N/A
0000032a-c4f...
Juristat, vol. 21, no. 4 - ARCHIVÉ
document
https://www150.statcan.gc.ca/n1/pub/85-002-x/85-002-x2001004-fra.pdf
2/16/2023 6/5/2001 French
Statistics Canada
Bureau of Statistics
Canada pdf N/A
0000039e-eb4...
document
https://www.bot.or.th/English/Statistics/FinancialInstitutions/BLS/BLSReport/E256504070.xls
1/5/2023 5/25/2022 English
Bank of Thailand
Central Bank
Thailand xls N/A
000003b9-e00...
Rate Lock Agreement
document
https://www.jpmorgan.com/content/dam/jpm/global/disclosures/IN/pds-ratelockagreement.pdf
1/4/2023 1/2/2023 English
JPMorgan Chase
Investment & Financial Services
United States of America pdf N/A
000003e2-c06...
PERIODO 2014
document
https://www.conatel.gov.py/conatel/wp-content/uploads/2020/01/midt-2014.pdf
2/4/2023 6/9/2015 Spanish
Comisión Nacional de Telecomunicaciones
Telecommunications Regulatory Authority
Paraguay pdf N/A
00000418-f32...
�S�̔�
document
https://www.soumu.go.jp/main_content/000413280.pdf
12/24/2022 4/7/2016 Japanese
Ministry of Internal Affairs and Communications
Ministry of Interior
Japan pdf N/A
00000499-9ae...
dokument
document
https://www.belex.rs/data/2017/07/00105138.pdf
2/5/2023 7/3/2017 Unrecognized
Belgrade Stock Exchange
Stock Exchange
Serbia pdf N/A
000004f1-52d...
462wp.pdf
document
https://kingcenter.stanford.edu/sites/g/files/sbiybj16611/files/media/file/462wp_0.pdf
2/7/2023 1/7/2013 English
Stanford University
University
United States of America pdf N/A
000005a2-fc9...
Veja aqui o Sumário Executivo da Medida Provisória
document
https://www12.senado.leg.br/publicacoes/estudos-legislativos/tipos-de-estudos/sumarios-de-proposicoes/mpv619/at_download/file
12/24/2022 6/17/2013 Portuguese
Senado Federal
Parliamentary Office
Brazil pdf N/A
000006e8-38a...
2012
document
https://www.bag.admin.ch/bag/fr/home/zahlen-und-statistiken/zahlen-fakten-zu-pflegeheimen/pflegeheim-suchen/_jcr_content/par/externalcontent.bitexternalcontent.exturl.pdf/aHR0cHM6Ly9zb21lZC5iYWdhcHBzLmNoL3BkZl9zZXJ2ZS9CQU/dfMTZmbnFtX2t6aF8yMDEyX2ZyLnBkZj92PTE2ODYwMTQ4ODM=/.pdf
6/7/2023 2/4/2019 French
Bundesamt für Gesundheit BAG
Ministry of Health
Switzerland pdf N/A
00000732-de0...
Download Report
document
https://euromed-economists.org/download/emea-webinar-report-reforming-international-debt-architecture-can-debt-transparency-be-achieved-for-africa-22-october-2020/?wpdmdl=10507&refresh=6418f4d2912221679357138
3/21/2023 11/19/2020 English
Euro-Mediterranean Economists Association
Think Tank & Research Institution
Global pdf N/A
0000074e-4f7...
Tim Bergsma
document
https://www.dol.gov/sites/dolgov/files/EBSA/laws-and-regulations/rules-and-regulations/public-comments/1210-AB44-2/09544.pdf
1/22/2023 10/12/2011 English
U.S. Department of Labor
Ministry of Labour
United States of America pdf N/A
0000075b-9a8...
餐厅、酒吧以及露天茶座的防疫指南 11.02.2021
document
https://www.sanidad.gob.es/areas/alertasEmergenciasSanitarias/alertasActuales/nCov/ciudadania/docs/20_09_15_COVID19_Consejos_bares_terrazas_ZH_AGL_JC_AB.pdf
8/11/2023 3/1/2021 Mandarin Chinese
Ministerio de Sanidad
Ministry of Health
Spain pdf N/A
000007ce-ba2...
金融高度化セミナー「金融データ交換ネットワークの高度化」の資料「日本銀行におけるXBRLへの取り組みについて」 [PDF 2,404KB]
document
https://www.boj.or.jp/announcements/release_2005/fsc0512c.pdf
12/1/2022 12/2/2016 Japanese
Bank of Japan
Central Bank
Japan pdf N/A
000009af-7d7...
Download Download PDF
document
https://eumj.med.sumdu.edu.ua/index.php/journal/article/download/25/20
7/25/2023 6/26/2019 English
Sumy State University
University
Ukraine pdf N/A
00000a1f-3c6...
Secretaria Executiva
document
https://www.gov.br/mj/pt-br/acesso-a-informacao/institucional/organogramas/ORGANOGRAMAS%20UNIDADES/organogramas-2023/organograma-se.pdf
6/8/2023 2/2/2023 Portuguese
Portal Único do Governo
Government Portal
Brazil pdf N/A
00000a8b-ca6...
Debt of All Levels of Government
document
https://service.mof.gov.tw/public/Data/statistic/monthly_e/11010/m2130_11010.xls
2/15/2023 11/30/2021 English
Ministry of Finance
Ministry of Finance
Taiwan xls N/A
00000ab1-f5e...
Estadística Condenados Menores - Personas condenadas - Año 2013Abre en nueva ventana
document
https://www.poderjudicial.es/stfls/ESTADISTICA%20JUDICIAL%20NUEVO/FICHEROS/7002%20Menores/A%C3%B1os%20Anteriores/A%C3%B1o%202013/Estadistica%20Condenados%20Menores%20Personas%20condenadas%202013.xls?t=202303203610
3/21/2023 12/9/2016 Spanish
Tribunal Supremo
Judicial Authority
Spain xls N/A

Use cases

  • Artificial Intelligence (AI)
  • Machine Learning (ML)
  • Large Language Models (LLM)
  • Natural Language Processing (NLP)
  • International Development
  • Strategy Consulting
  • Government
  • Finance
  • Pre-training
  • Fine-tuning
  • Conversational models

Get in touch for a custom quote

Data is available as cross-dimensional subsets. Custom subsets can be curated using filters such as content type, category, country. (e.g., transcripts of podcasts published by law firms in the US; reports published by competition authorities in Europe; time-stamped video transcripts of think tanks in French; etc.)

Contact us to request the data sample or a quote:

Contact us