[PDF] Jellyfish: A Large Language Model for Data Preprocessing | Semantic Scholar (2024)

Figures and Tables from this paper

  • figure 1
  • table 1
  • table 2
  • table 3
  • table 4
  • table 6
  • table 7
  • table 9
  • table 10
  • table 11
  • table 12
  • table 15

Ask This Paper

BETA

AI-Powered

Our system tries to constrain to information found in this paper. Results quality may vary. Learn more about how we generate these answers.

Feedback?

4 Citations

Entity Matching using Large Language Models
    R. PeetersChristian Bizer

    Computer Science

    ArXiv

  • 2023

It is shown that for use cases that do not allow data to be shared with third parties, open-source LLMs can be a viable alternative to hosted LLMs given that a small amount of training data or matching knowledge is required.

Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding - A Survey
    Xi FangWeijie Xu Christos Faloutsos

    Computer Science

    ArXiv

  • 2024

This survey aims to address the gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized, while providing some insights for future research directions in this vital and rapidly evolving field.

Quantitative knowledge retrieval from large language models
    David SelbyKai Spriestersbach Sebastian Vollmer

    Computer Science, Linguistics

    ArXiv

  • 2024

The feasibility of LLMs as a mechanism for quantitative knowledge retrieval to aid data analysis tasks such as elicitation of prior distributions for Bayesian models and imputation of missing data is explored.

Unicorn: A Unified Multi-Tasking Matching Model
    Ju FanJianhong Tu Nan Tang

    Computer Science

The proposed Unicorn, a unified model for generally supporting common data matching tasks, adopts a mixture-of-experts model that enhances the learned representation into a better representation and can achieve better performance on most tasks and on average, compared with the state-of-the-art specific models trained for ad-hoc tasks and datasets separately.

  • PDF

115 References

Large Language Models as Data Preprocessors
    Haochen ZhangYuyang DongChuan XiaoM. Oyamada

    Computer Science, Linguistics

    ArXiv

  • 2023

An LLM-based framework for data preprocessing is proposed, which integrates cutting-edge prompt engineering techniques, coupled with traditional methods like contextualization and feature selection, to improve the performance and efficiency of these models.

Platypus: Quick, Cheap, and Powerful Refinement of LLMs
    Ariel N. LeeCole J. HunterNataniel Ruiz

    Computer Science

    ArXiv

  • 2023

The Platypus family achieves strong performance in quantitative LLM metrics across model sizes, topping the global Open LLM leaderboard while using just a fraction of the fine-tuning data and overall compute that are required for other state-of-the-artfine-tuned LLMs.

Prefix-Tuning: Optimizing Continuous Prompts for Generation
    Xiang Lisa LiPercy Liang

    Computer Science

    ACL

  • 2021

Prefix-tuning is proposed, a lightweight alternative to fine- Tuning for natural language generation tasks, which keeps language model parameters frozen and instead optimizes a sequence of continuous task-specific vectors, which is called the prefix.

Batch Prompting: Efficient Inference with Large Language Model APIs
    Zhoujun ChengJungo KasaiTao Yu

    Computer Science

    EMNLP

  • 2023

Batch prompting, a simple yet effective prompting approach that enables the LLM to run inference in batches, instead of one sample at a time, is proposed, which reduces both token and time costs while retaining downstream performance.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
    Patrick LewisEthan Perez Douwe Kiela

    Computer Science

    NeurIPS

  • 2020

A general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation, and finds that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
    Simran AroraBrandon Yang Christopher Ré

    Computer Science

    Proc. VLDB Endow.

  • 2023

This work proposes and evaluates Evaporate, a prototype system powered by large language models, and proposes an extended implementation, Evaporate-Code+, which achieves better quality than direct extraction, and proposes an extended implementation, Evaporate-Code+, which achieves better quality than direct extraction.

TinyBERT: Distilling BERT for Natural Language Understanding
    Xiaoqi JiaoYichun Yin Qun Liu

    Computer Science

    FINDINGS

  • 2020

A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.

Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond
    Zhengjie MiaoYuliang LiXiaolan Wang

    Computer Science

    SIGMOD Conference

  • 2021

Rotom is a multi-purpose data augmentation framework for a range of data management and mining tasks including entity matching, data cleaning, and text classification that automatically learns a policy for combining examples from different DA operators, whereby combinatorially reduces the hyper-parameters space.

  • 39
  • PDF
LoRA: Low-Rank Adaptation of Large Language Models
    J. E. HuYelong Shen Weizhu Chen

    Computer Science

    ICLR

  • 2022

Low-Rank Adaptation, or LoRA, is proposed, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

  • 3,039
  • Highly Influential
  • [PDF]
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
    S. LongpreLe Hou Adam Roberts

    Computer Science, Education

    ICML

  • 2023

It is found task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings actually yields stronger performance in all settings.

...

...

Related Papers

Showing 1 through 3 of 0 Related Papers

    [PDF] Jellyfish: A Large Language Model for Data Preprocessing | Semantic Scholar (2024)
    Top Articles
    Latest Posts
    Article information

    Author: Lidia Grady

    Last Updated:

    Views: 6345

    Rating: 4.4 / 5 (45 voted)

    Reviews: 92% of readers found this page helpful

    Author information

    Name: Lidia Grady

    Birthday: 1992-01-22

    Address: Suite 493 356 Dale Fall, New Wanda, RI 52485

    Phone: +29914464387516

    Job: Customer Engineer

    Hobby: Cryptography, Writing, Dowsing, Stand-up comedy, Calligraphy, Web surfing, Ghost hunting

    Introduction: My name is Lidia Grady, I am a thankful, fine, glamorous, lucky, lively, pleasant, shiny person who loves writing and wants to share my knowledge and understanding with you.