Seguir
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Otros nombresPedro Javier Ortiz Suárez
Senior Research Scientist, Common Crawl Foundation
Dirección de correo verificada de commoncrawl.org - Página principal
Título
Citado por
Citado por
Año
Bloom: A 176b-parameter open-access multilingual language model
T Le Scao, A Fan, C Akiki, E Pavlick, S Ilić, D Hesslow, R Castagné, ...
11642023
CamemBERT: a Tasty French Language Model
L Martin, B Muller, PJ Ortiz Suárez, Y Dupont, L Romary, ...
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
10832020
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
PJ Ortiz Suárez, B Sagot, L Romary
7th Workshop on the Challenges in the Management of Large Corpora, 2019
399*2019
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
PJ Ortiz Suárez, L Romary, B Sagot
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
194*2020
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
J Kreutzer, I Caswell, L Wang, A Wahab, D van Esch, N Ulzii-Orshikh, ...
Transactions of the Association for Computational Linguistics 10, 50-72, 2022
178*2022
The bigscience roots corpus: A 1.6 tb composite multilingual dataset
H Laurençon, L Saulnier, T Wang, C Akiki, A Villanova del Moral, ...
Advances in Neural Information Processing Systems 35, 31809-31826, 2022
1022022
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv eprints, page
J Abadji, P Ortiz Suarez, L Romary, B Sagot
arXiv preprint arXiv:2201.06642, 2022
942022
Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus
J Abadji, PJO Suárez, L Romary, B Sagot
CMLC 2021-9th Workshop on Challenges in the Management of Large Corpora, 2021
452021
Building a user-generated content north-african arabizi treebank: Tackling hell
D Seddah, F Essaidi, A Fethi, M Futeral, B Muller, PJ Ortiz Suárez, ...
Proceedings of the 58th Annual Meeting of the Association for Computational …, 2020
432020
Quality at a glance: An audit of web-crawled multilingual datasets
I Caswell, J Kreutzer, L Wang, A Wahab, D van Esch, N Ulzii-Orshikh, ...
arXiv e-prints, arXiv: 2103.12028, 2021
322021
Establishing a New State-of-the-Art for French Named Entity Recognition
PJ Ortiz Suárez, Y Dupont, B Muller, L Romary, B Sagot
Proceedings of The 12th Language Resources and Evaluation Conference, 4631–4638, 2020
21*2020
From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
S Gabay, P Ortiz Suarez, A Bartz, A Chagué, R Bawden, P Gambette, ...
arXiv preprint arXiv:2202.09452, 2022
122022
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
A McMillan-Major, Z Alyafeai, S Biderman, K Chen, F De Toni, G Dupont, ...
arXiv preprint arXiv:2201.10066, 2022
122022
Les modèles de langue contextuels Camembert pour le français: impact de la taille et de l'hétérogénéité des données d'entrainement
L Martin, B Muller, PJ Ortiz Suárez, Y Dupont, L Romary, E Clergerie, ...
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP …, 2020
112020
Perplexed by quality: A perplexity-based method for adult and harmful content detection in multilingual heterogeneous web data
T Jansen, Y Tong, V Zevallos, PO Suarez
arXiv preprint arXiv:2212.10440, 2022
92022
Automatic extraction of materials and properties from superconductors scientific literature
L Foppiano, PB Castro, P Ortiz Suarez, K Terashima, Y Takano, M Ishii
Science and Technology of Advanced Materials: Methods 3 (1), 2153633, 2023
82023
Bertrade: Using contextual embeddings to parse old french
L Grobol, M Regnault, PO Suarez, B Sagot, L Romary, B Crabbé
13th Language Resources and Evaluation Conference, 2022
62022
SinNer@CLEF-HIPE2020: Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers
PJ Ortiz Suárez, Y Dupont, G Lejeune, T Tian
CLEF 2020 Working Notes 2696, 2020
6*2020
Tokenizer Choice For LLM Training: Negligible or Crucial?
M Ali, M Fromm, K Thellmann, R Rutmann, M Lübbering, J Leveling, ...
arXiv preprint arXiv:2310.08754, 2023
42023
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus
M Popa-Fabre, PJ Ortiz Suárez, B Sagot, ÉV de la Clergerie
Proceedings of the 8th Workshop on Challenges in the Management of Large …, 2020
32020
El sistema no puede realizar la operación en estos momentos. Inténtalo de nuevo más tarde.
Artículos 1–20