Lexandria Dataset
Any AI, Analytics or Data project is only as good as the data that it’s fed. At Lexandria, we pride ourselves on our ability to vet and qualify the best and most credible web data sources online. Our proprietary technology sources and classifies web data across +130 categories in +180 geographies.
Multi-Lingual Support
Our collection of data spans multiple languages, making it versatile for training models that can operate in a global context.
High-Quality Content
The inclusion of data from authoritative sources ensures that the information is reliable, accurate, and relevant, adding an extra layer of quality to your trained models.
Industry-Specific Classifications
Our proprietary classification nomenclature allows for targeted training and fine-tuning of models, based on the specific industry or vertical you are interested in.
Comprehensive Metadata
The extensive metadata accompanying each piece of content makes it easier to perform data analysis, selection, and manipulation for specialized model training.
High Volume
Given the size and scope of our dataset, you can train models that are capable of a deep understanding across various domains.
Historical Context
Our dataset includes information dated back several years, making it suitable for models that require historical context for accurate predictions and analyses.
Regular Updates
Our dataset is continually updated with the latest content, ensuring that your models stay relevant and timely.
Highlights
- Over 15 million documents
- Over 200 thousand video transcripts
- Over 15 thousand podcast transcripts
- In total over 70 billion words in the dataset
Data description
Our curation process involves the systematic extraction and analysis of data from rigorously
vetted sources. These sources include official entities such as government agencies, regulatory
bodies, central banks, statistical offices, international development institutions, renowned
consulting firms, law firms and think tanks.
We parse the content of documents and transcribe the audio content from videos and podcasts published
by these vetted institutions.
Our dataset is ideal to train or fine-tune large language models for generative or other AI use cases
for various industries and verticals.
Timed text data
The timed text data for podcasts and videos can provide nuanced training material for applications that require time-stamped or sequential information. These transcripts are also particularly valuable for training models that specialize in understanding the context and flow of conversations.
Categories
Regions
Languages
Data dictionary
id string
Unique identifier for the document
source string
Source name
category string
Nomenclature category
country string
Country/ region origin of the source
indexDate timestamp
Crawling date and time
documentDate timestamp
Date and time the document was created (not always available)
type string
Content type (document, video or podcast)
fileType string
File type for documents (e.g. pdf, doc, xlsx, …)
language string
Content language (ISO 639-3)
title string
Content title
url string
URL of the piece of content
text string
Text for documents and time-stamped transcript for videos and podcasts
timedText JSON
JSONArray of objects: { text: string, start: number, end: number, speaker?: number }
Explore our sample set
We have prepared a sample set of our data for you to explore. The sample set contains 1,000 podcasts, 2,000 videos and 7,000 documents. Those documents have been sampled randomly within their respective categories.
id | title | type | url | indexDate | documentDate | language | source | category | country | fileType | text | timedText |
---|---|---|---|---|---|---|---|---|---|---|---|---|
00000165-c82... | Statistik zur Überschuldung privater Personen - Fachserie 15 Reihe 5 - 2021 - (Letzte Ausgabe, beric
|
document | https://www.destatis.de/DE/Themen/Gesellschaft-Umwelt/Einkommen-Konsum-Lebensbedingungen/Vermoegen-Schulden/Publikationen/Downloads-Vermoegen-Schulden/ueberschuldung-2150500217005.xlsx?__blob=publicationFile
|
1/15/2023 | 5/25/2022 | German | Statistisches Bundesamt
|
Bureau of Statistics
|
Germany | xlsx | N/A | |
0000016c-aae... | Download Report
|
document | https://euromed-economists.org/download/emea-webinar-report-digital-financial-inclusion-a-pillar-for-resilience-post-covid-19-16-july-2020/?wpdmdl=9758&refresh=641913ca9cba61679365066
|
3/21/2023 | 7/29/2020 | English | Euro-Mediterranean Economists Association
|
Think Tank & Research Institution
|
Global | N/A | ||
000002f6-791... | Association of University Centers on Disabilities
|
document | https://www.dol.gov/sites/dolgov/files/EBSA/laws-and-regulations/rules-and-regulations/public-comments/1210-AB45/00054.pdf
|
1/22/2023 | 9/21/2010 | English | U.S. Department of Labor
|
Ministry of Labour
|
United States of America | N/A | ||
0000032a-c4f... | Juristat, vol. 21, no. 4 - ARCHIVÉ
|
document | https://www150.statcan.gc.ca/n1/pub/85-002-x/85-002-x2001004-fra.pdf
|
2/16/2023 | 6/5/2001 | French | Statistics Canada
|
Bureau of Statistics
|
Canada | N/A | ||
0000039e-eb4... | document | https://www.bot.or.th/English/Statistics/FinancialInstitutions/BLS/BLSReport/E256504070.xls
|
1/5/2023 | 5/25/2022 | English | Bank of Thailand
|
Central Bank
|
Thailand | xls | N/A | ||
000003b9-e00... | Rate Lock Agreement
|
document | https://www.jpmorgan.com/content/dam/jpm/global/disclosures/IN/pds-ratelockagreement.pdf
|
1/4/2023 | 1/2/2023 | English | JPMorgan Chase
|
Investment & Financial Services
|
United States of America | N/A | ||
000003e2-c06... | PERIODO 2014
|
document | https://www.conatel.gov.py/conatel/wp-content/uploads/2020/01/midt-2014.pdf
|
2/4/2023 | 6/9/2015 | Spanish | Comisión Nacional de Telecomunicaciones
|
Telecommunications Regulatory Authority
|
Paraguay | N/A | ||
00000418-f32... | �S�̔�
|
document | https://www.soumu.go.jp/main_content/000413280.pdf
|
12/24/2022 | 4/7/2016 | Japanese | Ministry of Internal Affairs and Communications
|
Ministry of Interior
|
Japan | N/A | ||
00000499-9ae... | dokument
|
document | https://www.belex.rs/data/2017/07/00105138.pdf
|
2/5/2023 | 7/3/2017 | Unrecognized | Belgrade Stock Exchange
|
Stock Exchange
|
Serbia | N/A | ||
000004f1-52d... | 462wp.pdf
|
document | https://kingcenter.stanford.edu/sites/g/files/sbiybj16611/files/media/file/462wp_0.pdf
|
2/7/2023 | 1/7/2013 | English | Stanford University
|
University
|
United States of America | N/A | ||
000005a2-fc9... | Veja aqui o Sumário Executivo da Medida Provisória
|
document | https://www12.senado.leg.br/publicacoes/estudos-legislativos/tipos-de-estudos/sumarios-de-proposicoes/mpv619/at_download/file
|
12/24/2022 | 6/17/2013 | Portuguese | Senado Federal
|
Parliamentary Office
|
Brazil | N/A | ||
000006e8-38a... | 2012
|
document | https://www.bag.admin.ch/bag/fr/home/zahlen-und-statistiken/zahlen-fakten-zu-pflegeheimen/pflegeheim-suchen/_jcr_content/par/externalcontent.bitexternalcontent.exturl.pdf/aHR0cHM6Ly9zb21lZC5iYWdhcHBzLmNoL3BkZl9zZXJ2ZS9CQU/dfMTZmbnFtX2t6aF8yMDEyX2ZyLnBkZj92PTE2ODYwMTQ4ODM=/.pdf
|
6/7/2023 | 2/4/2019 | French | Bundesamt für Gesundheit BAG
|
Ministry of Health
|
Switzerland | N/A | ||
00000732-de0... | Download Report
|
document | https://euromed-economists.org/download/emea-webinar-report-reforming-international-debt-architecture-can-debt-transparency-be-achieved-for-africa-22-october-2020/?wpdmdl=10507&refresh=6418f4d2912221679357138
|
3/21/2023 | 11/19/2020 | English | Euro-Mediterranean Economists Association
|
Think Tank & Research Institution
|
Global | N/A | ||
0000074e-4f7... | Tim Bergsma
|
document | https://www.dol.gov/sites/dolgov/files/EBSA/laws-and-regulations/rules-and-regulations/public-comments/1210-AB44-2/09544.pdf
|
1/22/2023 | 10/12/2011 | English | U.S. Department of Labor
|
Ministry of Labour
|
United States of America | N/A | ||
0000075b-9a8... | 餐厅、酒吧以及露天茶座的防疫指南 11.02.2021
|
document | https://www.sanidad.gob.es/areas/alertasEmergenciasSanitarias/alertasActuales/nCov/ciudadania/docs/20_09_15_COVID19_Consejos_bares_terrazas_ZH_AGL_JC_AB.pdf
|
8/11/2023 | 3/1/2021 | Mandarin Chinese | Ministerio de Sanidad
|
Ministry of Health
|
Spain | N/A | ||
000007ce-ba2... | 金融高度化セミナー「金融データ交換ネットワークの高度化」の資料「日本銀行におけるXBRLへの取り組みについて」 [PDF 2,404KB]
|
document | https://www.boj.or.jp/announcements/release_2005/fsc0512c.pdf
|
12/1/2022 | 12/2/2016 | Japanese | Bank of Japan
|
Central Bank
|
Japan | N/A | ||
000009af-7d7... | Download Download PDF
|
document | https://eumj.med.sumdu.edu.ua/index.php/journal/article/download/25/20
|
7/25/2023 | 6/26/2019 | English | Sumy State University
|
University
|
Ukraine | N/A | ||
00000a1f-3c6... | Secretaria Executiva
|
document | https://www.gov.br/mj/pt-br/acesso-a-informacao/institucional/organogramas/ORGANOGRAMAS%20UNIDADES/organogramas-2023/organograma-se.pdf
|
6/8/2023 | 2/2/2023 | Portuguese | Portal Único do Governo
|
Government Portal
|
Brazil | N/A | ||
00000a8b-ca6... | Debt of All Levels of Government
|
document | https://service.mof.gov.tw/public/Data/statistic/monthly_e/11010/m2130_11010.xls
|
2/15/2023 | 11/30/2021 | English | Ministry of Finance
|
Ministry of Finance
|
Taiwan | xls | N/A | |
00000ab1-f5e... | Estadística Condenados Menores - Personas condenadas - Año 2013Abre en nueva ventana
|
document | https://www.poderjudicial.es/stfls/ESTADISTICA%20JUDICIAL%20NUEVO/FICHEROS/7002%20Menores/A%C3%B1os%20Anteriores/A%C3%B1o%202013/Estadistica%20Condenados%20Menores%20Personas%20condenadas%202013.xls?t=202303203610
|
3/21/2023 | 12/9/2016 | Spanish | Tribunal Supremo
|
Judicial Authority
|
Spain | xls | N/A |
Use cases
- Artificial Intelligence (AI)
- Machine Learning (ML)
- Large Language Models (LLM)
- Natural Language Processing (NLP)
- International Development
- Strategy Consulting
- Government
- Finance
- Pre-training
- Fine-tuning
- Conversational models
Get in touch for a custom quote
Data is available as cross-dimensional subsets. Custom subsets can be curated using filters such
as content type, category, country. (e.g., transcripts of podcasts published by law firms in the US; reports published by
competition authorities in Europe; time-stamped video transcripts of think tanks in French;
etc.)
Contact us to request the data sample or a quote: