[PDF] Jellyfish: A Large Language Model for Data Preprocessing

Figures and Tables from this paper

figure 1
table 1
table 2
table 3
table 4
table 6
table 7
table 9
table 10
table 11
table 12
table 15

Ask This Paper
BETA
AI-Powered

Our system tries to constrain to information found in this paper. Results quality may vary. Learn more about how we generate these answers.

Feedback?

4 Citations

Entity Matching using Large Language Models

R. PeetersChristian Bizer

Computer Science

ArXiv

2023

It is shown that for use cases that do not allow data to be shared with third parties, open-source LLMs can be a viable alternative to hosted LLMs given that a small amount of training data or matching knowledge is required.

[PDF]

Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding - A Survey

Xi FangWeijie Xu Christos Faloutsos

Computer Science

ArXiv

2024

This survey aims to address the gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized, while providing some insights for future research directions in this vital and rapidly evolving field.

[PDF]

Quantitative knowledge retrieval from large language models

David SelbyKai Spriestersbach Sebastian Vollmer

Computer Science, Linguistics

ArXiv

2024

The feasibility of LLMs as a mechanism for quantitative knowledge retrieval to aid data analysis tasks such as elicitation of prior distributions for Bayesian models and imputation of missing data is explored.

[PDF]

Unicorn: A Uniﬁed Multi-Tasking Matching Model

Ju FanJianhong Tu Nan Tang

Computer Science

The proposed Unicorn, a uniﬁed model for generally supporting common data matching tasks, adopts a mixture-of-experts model that enhances the learned representation into a better representation and can achieve better performance on most tasks and on average, compared with the state-of-the-art speciﬁc models trained for ad-hoc tasks and datasets separately.

115 References

Large Language Models as Data Preprocessors

Haochen ZhangYuyang DongChuan XiaoM. Oyamada

Computer Science, Linguistics

ArXiv

2023

An LLM-based framework for data preprocessing is proposed, which integrates cutting-edge prompt engineering techniques, coupled with traditional methods like contextualization and feature selection, to improve the performance and efficiency of these models.

4
Highly Influential

[PDF]

Platypus: Quick, Cheap, and Powerful Refinement of LLMs

Ariel N. LeeCole J. HunterNataniel Ruiz

Computer Science

ArXiv

2023

The Platypus family achieves strong performance in quantitative LLM metrics across model sizes, topping the global Open LLM leaderboard while using just a fraction of the fine-tuning data and overall compute that are required for other state-of-the-artfine-tuned LLMs.

[PDF]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa LiPercy Liang

Computer Science

ACL

2021

Prefix-tuning is proposed, a lightweight alternative to fine- Tuning for natural language generation tasks, which keeps language model parameters frozen and instead optimizes a sequence of continuous task-specific vectors, which is called the prefix.

2,464

[PDF]

Batch Prompting: Efficient Inference with Large Language Model APIs

Zhoujun ChengJungo KasaiTao Yu

Computer Science

EMNLP

2023

Batch prompting, a simple yet effective prompting approach that enables the LLM to run inference in batches, instead of one sample at a time, is proposed, which reduces both token and time costs while retaining downstream performance.

[PDF]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick LewisEthan Perez Douwe Kiela

Computer Science

NeurIPS

2020

A general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation, and finds that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

1,727

[PDF]

Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes

Simran AroraBrandon Yang Christopher Ré

Computer Science

Proc. VLDB Endow.

2023

This work proposes and evaluates Evaporate, a prototype system powered by large language models, and proposes an extended implementation, Evaporate-Code+, which achieves better quality than direct extraction, and proposes an extended implementation, Evaporate-Code+, which achieves better quality than direct extraction.

[PDF]

TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi JiaoYichun Yin Qun Liu

Computer Science

FINDINGS

2020

A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.

1,425

[PDF]

Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond

Zhengjie MiaoYuliang LiXiaolan Wang

Computer Science

SIGMOD Conference

2021

Rotom is a multi-purpose data augmentation framework for a range of data management and mining tasks including entity matching, data cleaning, and text classification that automatically learns a policy for combining examples from different DA operators, whereby combinatorially reduces the hyper-parameters space.

LoRA: Low-Rank Adaptation of Large Language Models

J. E. HuYelong Shen Weizhu Chen

Computer Science

ICLR

2022

Low-Rank Adaptation, or LoRA, is proposed, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

3,039
Highly Influential

[PDF]

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

S. LongpreLe Hou Adam Roberts

Computer Science, Education

ICML

2023

It is found task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings actually yields stronger performance in all settings.

[PDF]

...

Related Papers

Showing 1 through 3 of 0 Related Papers

[PDF] Jellyfish: A Large Language Model for Data Preprocessing | Semantic Scholar (2024)

Figures and Tables from this paper

Ask This PaperBETAAI-Powered

4 Citations

115 References

Related Papers

Ask This Paper
BETA
AI-Powered