GitHub - MaartenGr/KeyBERT: Minimal keyword extraction with BERT
GitHub - arthur-ai/bench: A tool for evaluating LLMs
GitHub - arthur-ai/bench: A tool for evaluating LLMs
BA Builder added
Text embeddings are a critical piece of many pipelines, from search, to RAG, to vector databases and more. Most embedding models are BERT/Transformer-based and typically have short context lengths (e.g., 512). That’s only about two pages of text, but documents can be very long – books, legal cases, TV screenplays, code repositories, etc can be tens... See more
Long-Context Retrieval Models with Monarch Mixer
Nicolay Gerold added
1. Introduction of DeepSeek Coder
DeepSeek Coder is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code co... See more
DeepSeek Coder is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code co... See more
deepseek-ai • GitHub - deepseek-ai/DeepSeek-Coder: DeepSeek Coder: Let the Code Write Itself
Nicolay Gerold added
Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in
this paper and first released in
this repository. This model is trained on uppercase amino acids: it only works with capital letter amino acids.
Model description
ProtBert is based on Bert model which pretrained on a large corpus of protein sequ... See more
this paper and first released in
this repository. This model is trained on uppercase amino acids: it only works with capital letter amino acids.
Model description
ProtBert is based on Bert model which pretrained on a large corpus of protein sequ... See more
Rostlab/prot_bert · Hugging Face
Nicolay Gerold added
I used this system to design self-documenting components that were easy to develop against, and allowed the rest of the team to build new and consistent UI without needing me.
Pirijan Ketheswaran • Redesigning an App, One Day a Week at a Time
gabriel added
- Leveraging Document and Corpus Structure- Scaling to Multiple Languages]
Donald Metzler • Rethinking Search: Making Domain Experts out of Dilettantes
Benjamin Searle added
ColBERT is a
fast
and
accurate
retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.
Figure 1: ColBERT's late interaction, efficiently scoring the fine-grained similarity between a queries and a passage.
As Figure 1 illustrates, ColBERT relies on fine-grained contextual late interaction : it encod... See more
fast
and
accurate
retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.
Figure 1: ColBERT's late interaction, efficiently scoring the fine-grained similarity between a queries and a passage.
As Figure 1 illustrates, ColBERT relies on fine-grained contextual late interaction : it encod... See more
stanford-futuredata • GitHub - stanford-futuredata/ColBERT: Stanford ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22)
Nicolay Gerold added
𝗺𝗲𝘁𝗵𝗼𝗱𝘀 𝗼𝗳 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴 𝗮𝗻 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲 𝗟𝗟𝗠 𝗲𝘅𝗶𝘀t ↓
- 𝘊𝘰𝘯𝘵𝘪𝘯𝘶𝘦𝘥 𝘱𝘳𝘦-𝘵𝘳𝘢𝘪𝘯𝘪𝘯𝘨: utilize domain-specific data to apply the same pre-training process (next token prediction) on the pre-trained (base) model
- 𝘐𝘯𝘴𝘵𝘳𝘶𝘤𝘵𝘪𝘰𝘯 𝘧𝘪𝘯𝘦-𝘵𝘶𝘯𝘪𝘯𝘨: the pre-trained (base) model is fine-tuned on ... See more
- 𝘊𝘰𝘯𝘵𝘪𝘯𝘶𝘦𝘥 𝘱𝘳𝘦-𝘵𝘳𝘢𝘪𝘯𝘪𝘯𝘨: utilize domain-specific data to apply the same pre-training process (next token prediction) on the pre-trained (base) model
- 𝘐𝘯𝘴𝘵𝘳𝘶𝘤𝘵𝘪𝘰𝘯 𝘧𝘪𝘯𝘦-𝘵𝘶𝘯𝘪𝘯𝘨: the pre-trained (base) model is fine-tuned on ... See more
Shortwave — rajhesh.panchanadhan@gmail.com [Gmail alternative]
Nicolay Gerold added