Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models

Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models

Luke Merrick Snowflake Inc. Corresponding author, luke.merrick@snowflake.com Danmei Xu Snowflake Inc. Gaurav Nuti Snowflake Inc. Daniel Campos Snowflake Inc.
Abstract

This report describes the training dataset creation and recipe behind the family of arctic-embed text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard,111https://huggingface.co/spaces/mteb/leaderboard with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere’s embed-v3 and Open AI’s text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance.

1 Introduction

Embedding models’ ability to provide accurate retrieval performance without additional tuning Lewis et al. (2020) has made them a popular choice in search and Retrieval-augmented-generation (RAG) Ram et al. (2023) workloads.
Unlike traditional keyword search, embedding models encode information beyond token overlap. This allows embedding systems to represent queries like How tall is Tom Cruise? and Height of the actor who plays Maverick in Top Gun closely despite having no common words.

Refer to caption
Figure 1: Snowflake’s Arctic-embed models are a suite of 5 embedding models, each of which pushes the Pareto frontier in the trade-off between model size and retrieval performance on the MTEB Retrieval Leaderboard.

Driven by the utility and the widespread adoption of these models, the broader open-source and research community has put forth a constant stream of ever-stronger text embedding models such as E5 Wang et al. (2022), GTE Li et al. (2023a), Jina Günther et al. (2024). The quick experimentation and improvement underpinning these works is, in turn, thanks in part to large-scale open evaluation benchmarks such as MSMARCO Campos et al. (2016), BEIR Thakur et al. (2021), and MTEB Muennighoff et al. (2023). These leaderboards combine an easy and efficient evaluation with a broad array of tasks, which allows for effective experimentation.

This paper’s work was motivated in early 2024 by the lack of efficient and effective open-text embedding models competing with the performance of closed-source models such as Cohere’s embed-v3 or OpenAI’s text-embed-3-large. While models such as SFR-Embedding-Mistral Yavuz (2024) and GritLM Muennighoff et al. (2024) outscore proprietary offerings; their size (each over 7 billion parameters) and their dimensionality (each 4096) make them impractical to use in many production workloads. Seeking to provide a high-quality retrieval model with fewer than a billion parameters, we set out to train a suite of high-quality embedding models.

Through fruitful data-centric experiments, we developed the recently released Arctic family of text embedding models. Based on five encoder-only pretrained language models of various sizes (see Table 1) and leveraging the same training data and methodology, we trained each model to optimize retrieval performance as measured by nDCG@10 on the MTEB Retrieval leaderboard. As shown in Figure 1, each variant achieved a new state-of-the-art performance for its size.222As of April 16th, 2024. We present these models and this technical report as a journal of our experiments that led to our improvements in performance.

1.1 Summary of Contributions

Open model release. We release a suite of embedding models, Arctic-embed, under a permissive Apache-2 license, which delivers state-of-the-art retrieval performance for their size/context window class on the Retrieval portion of the MTEB leaderboard.

Demonstrated importance of data organization. We present a set of ablations that suggest improvements in retrieval quality are more strongly tied to data sampling during training and the method of negative mining than scaling up data scale and batch size, where previous work has focused.

Improved methods for synthetic data. We present a novel technique for query generation grounded by mined hard negatives, which we found more effective than straightforward generation approaches that generate both queries and negatives and which served as a key ingredient in our models’ success.

Size Base Model (Huggingface ID) Parameters (M) Embedding Dimension
xs nreimers/MiniLM-L6-H384-uncased 23 384
s intfloat/e5-unsupervised-small 33 384
m intfloat/e5-unsupervised-base 110 764
m-long nomic-ai/nomic-embed-text-v1-unsupervised 137 768
l intfloat/e5-unsupervised-large 334 1024
Table 1: Breakdown of model architectures.

2 Background

2.1 Task Description

An embedding model maps a variable input into a fixed-dimensional vector. This one-way transform can be applied to various modalities, scales, and scopes and directly used for downstream tasks such as classification, clustering, or retrieval. In the scope of our work, we focus on text embeddings for retrieval. This task aims to train a model that maximizes the similarity between relevant documents, given a query and a document collection, while minimizing the similarity with irrelevant documents.

The representation-based retrieval method has emerged as a standard paradigm as it minimizes the frequency with which inputs are transformed into vectors. Offline, the document corpus is processed, resulting in a set of vectors stored in an Approximate Nearest Neighbor Index such as FAISS Douze et al. (2024). The input is transformed online into a vector at query time, and the documents with the closest embeddings are retrieved. In other words, the cosine distance between queries and documents signals relevance.

2.2 Related Work

Training Approaches: Building on prior success in NLP and IR research knowledge, embedding model training uses supervised learning with examples of positive and negative document query pairs. It is common to use labeled positive and negative query-document retrieval examples (commonly extracted from weak signals or labeled data Lin et al. (2020)) to fine-tune general-purpose pre-trained language models into specialized text embedding models. In this paradigm, Qu et al. (2021) demonstrated the importance of scaling the batch size and training on hard negatives. In contrast, Xiong et al. (2020) demonstrated the importance of adapting the negatives’ difficulty to the retriever’s competence.

While earlier work focused on leveraging supervised datasets such as HotpotQA Yang et al. (2018) or NQ Kwiatkowski et al. (2019), Wang et al. (2022) demonstrated the effectiveness of constructing large datasets from web-crawled title-document examples through the groundbreaking performance of their resulting E5 model. Xiao et al. (2023a) and Nussbaum et al. (2024) combine generated datasets with supervised labeled datasets to improve retrieval performance further.

Model Architecture: Building on the success and utility of the transformer Vaswani et al. (2017) prior work has primarily focused on training models using BERT Devlin et al. (2018), or variants thereof. While some work has studied the usage of sequence to sequence Zhuang et al. (2022) or large decoder-only models Yavuz (2024) Muennighoff et al. (2024), these models’ increased model size and associated worse inference efficiency have kept the majority of focus on encoder-only variants.

Training Objective: Many works initially trained retrievers and rankers leveraging traditional loss forms such as Mean Squared Error Lin et al. (2020). Still, recently, the application of a contrastive loss Hadsell et al. (2006); Mueller and Thyagarajan (2016), which leverages not only positive pairs but the relationship between positive and negative pairs, has risen to prominence. InfoNCE (Noise Contrastive Estimation) van den Oord et al. (2018) improved on the constrastive triplet loss and has quickly become one of the most popular and common losses used to train embedding models.

3 Arctic Embed

Difference Description Ablation Study
Better data We leverage web search data and common web data filtering methods. Shows Improvement
Source stratification Discussed further in Section 4.4, we fill each minibatch of training data with examples from a single data source. Nomic also uses the technique. Shows Improvement
Longer pretraining sequence length We used a maximum sequence length of 256, which is twice as long as in GTE and BGE Li et al. (2023b); Xiao et al. (2023b), despite using approximately the same batch size as these works. Shows Improvement
Base models trained for retrieval Several sizes of Arctic-embed use pre-trained text embedding models as starting points. For example, we use e5-unsupervised-base instead of its parent model bert-base-uncased as the backbone of arctic-embed-m. Shows Inconsistent Improvement
[CLS] embeddings Unlike most prior art, Arctic embed uses [CLS] token embeddings instead of mean pooling. BGE also uses the technique. Not Studied
Implementation and tuning We iterated relentlessly on the data mix, negative mining strategy, batch size, and other training parameters to double down on our data advantages in both training rounds. Additionally, our training implementation includes several attention-to-detail tricks (e.g., in-batch document deduplication), which may improve performance compared to more naive implementations. Not Studied
Table 2: Differences from prior works hypothesized to help Arctic embed score higher on MTEB Retrieval.

With Arctic-embed, we aimed to start from the current consensus of best practices from the literature and train an embedding model from the ground up.

Consistent with prior works, like E5, BGE, GTE, Jina, and Nomic Wang et al. (2024); Xiao et al. (2023b); Li et al. (2023b); Günther et al. (2024); Nussbaum et al. (2024), we conduct two training rounds using two different kinds of datasets. The initial training round is large-scale pretraining using only in-batch negative examples. This round of training leverages a dataset of pairs of queries and relevant documents. The second round of training (often referred to as the fine-tuning step) calls for similar pairs of queries and documents augmented with an additional set of “hard” negative documents (where “hard” refers to the fact that it is not trivial to determine their lower relevance relative to the labeled-as-relevant document). We used a tunable negative mining strategy (see Section 3.3) to construct a focused dataset of about a million samples for this round of training.

Although our work closely replicates many of the steps prior works took, our resulting models score higher on the MTEB Retrieval benchmark, sometimes by a substantial margin. In Table 2 we present several hypotheses about what led to this improved performance, and in Section 7 we test several of these hypotheses through ablation studies.

3.1 Model Architecture

We trained models of varying sizes from BERT-like backbones as shown in Table 1. Our m and l are standard BERT architecture Devlin et al. (2019) (BERT base and large, respectively). We looked to variants of the MiniLMv2 architecture Wang et al. (2021) for our smaller sizes (xs and s), and we opted for the Nomic BERT architecture Nussbaum et al. (2024) for our long-context variant (m-long).

3.2 Pooling

Architecturally, we do not modify any base model, even just the common practice of adding a pooling layer to the base model.333e.g., we use AutoModel.from_pretrained(…, add_pooling_layer=False) in the transformers Python package) Additionally, instead of pooling output vectors, we utilize the final hidden state of the [CLS] token as the embedding vector, in contrast to the mean pooling strategy used in E5, GTE, and Nomic Wang et al. (2024); Li et al. (2023b); Nussbaum et al. (2024). This choice matches the BGE architecture Xiao et al. (2023b) and is inspired by the ablation study in Li and Li (2023), showing this led to a 2.5% higher score on the Semantic Text Similarity (STS) evaluation studied.

3.3 Training Data

In creating our training datasets, we took inspiration from the world of Large Language Models (LLMs) and leveraged filtering methods inspired by RefinedWeb Penedo et al. (2023), C4 Rae et al. (2022), Gopher Raffel et al. (2023), and TogetherAI Computer (2023).

First, for noisy raw data sources such as web search, we parse structured web documents using trafilatura444https://trafilatura.readthedocs.io/en/latest/. While parsing, we compute custom signals for quality filtering. Specifically for positive data pair cleaning, we need to ensure: a) each text in the pair is of good quality (language filter, text quality filter) and b) text pairs (query, document) are similar in meaning (consistency filter). For quality filtering, we leverage a series of filters similar to ones detailed in Snowflake’s Arctic model training cookbook555https://medium.com/snowflake/snowflake-arctic-cookbook-series-arctics-approach-to-data-b81a8a0958bd. A complete list of effective filtering methods can be found in Appendix C. We combine these filters to create a more curated dataset by removing low-quality, irrelevant, or potentially spam documents based on various characteristics related to content quality, language structure, and duplication.

For consistency filtering, we apply a low-fidelity, high-throughput pair-similarity consistency filter — sentence similarity using a fastText666https://fasttext.cc/docs/en/english-vectors.html word2vec model (which can be run cheaply on CPU). Rather than treating these embeddings’ signal as a clear quality label, we instead adopt a conservative threshold (a low minimum allowed similarity of 0.3) and use them to filter out unrelated examples (e.g., “CGplayer doesn’t work properly without JavaScript-enabled” documents from web crawl failures). Additionally, we truncate long sequences to 512 words during this step. As we observed, queries in the web-based corpus were usually answered at the beginning of the document. Not only was it computationally wasteful, but even the meaning captured in word2vec embeddings would get diluted by averaging vectors from irrelevant words present later.

3.4 Dataset Mix And Sampling

Due to the different datasets’ sizes, consistency, hardness, and learning dynamics, simply concatenating all available datasets together proved a suboptimal strategy, especially in the fine-tuning stage. Instead, we ran isolated experiments to understand the effects of each dataset on fine-tuned performance. Then, we selected and combined datasets based on their relative performance in these experiments. Each data source we used is described in more depth below.

Refer to caption
Figure 2: Composition of our pretraining dataset containing about 300 million query document pairs.
Refer to caption
Figure 3: Composition of our finetuning dataset, which contains about 1 million queries, each paired with a positive document and a set of hard negative documents.

Our large pretraining dataset, described in figure 2, amounts to 308 million query-document pairs (filtered from around 2 billion documents), of which 71% are web search documents paired with either a query or title. Aside from web search data, text pairs set include PAQ777https://github.com/facebookresearch/PAQ, StackExchange title-body and title-body web documents pairs from common crawl based sources, and S2ORC title-abstract pairs888https://github.com/allenai/s2orc. We have found the previous steps on quality annotation and filter transformative in improving quality and pruning noise in web search data and beyond for pairwise positive datasets.

Our fine-tuning dataset, described in Figure 3, consists of around 1 million pairs built by combining our web search data with several public datasets (HotpotQA999https://github.com/hotpotqa/hotpot, NQ101010https://github.com/google-research-datasets/natural-questions, Fever111111https://fever.ai/dataset/fever.html, and StackExchange title-body121212https://huggingface.co/datasets/sentence-transformers/embedding-training-data), and then performing further expansion via synthetic mining strategy detailed in section below. This mix notably omits several popular public datasets used by other embedding models due to our observation of positive pair consistency and negative pair level of hardness. These found-to-be-less-useful datasets include NLI, MEDI, WikiAnswers, and SQuAD. Empirically, we have observed that quantity is less important than quality in the finetuning phase, and an overpowering amount of low-quality data can lead to lower-quality models.

3.5 Synthetic Data For Semantic Dense Mining

Compared to the abundance of web-scale data used in pretraining, high-quality examples suitable for finetuning are more scarce. To address this data scarcity, we used synthetic data creation to construct additional datasets that benefited downstream performance just as much as those listed above. Similar to the prior work of Dai et al. (2022); Lee et al. (2024), we leverage Large Language Models to generate novel queries. Breaking from these previous approaches, however, we found it critical to add negative documents to our LLM inputs to ground the query generation (see Algorithm 2 in the Appendix for details). Additionally, we chose to generate only synthetic queries rather than synthetic negatives because we found that LLMs do not easily generate relevant negatives of as high quality as those mined from a preexisting corpus of documents. Figure 4 shows this approach in action – two datasets generated by variants of Algorithm 2 led to score increases approaching that afforded by the original HotpotQA.

Refer to caption
Figure 4: Comparison of model performance when training using synthetic datasets generated from the HotpotQA document corpus versus the original HotpotQA queries (using Algorithm 1 for negative mining in both cases).

3.6 Tunable Hard Negative Mining

Fine-tuning datasets typically include carefully chosen “hard” negative examples and a positively-relevant query-document pair. How hard should these negatives be for maximally effective learning in the fine-tuning phase? Our answer to this question was ultimately a tunable hard negative mining strategy in which we leveraged a preexisting text embedding model to identify and score the hardest negatives for each training example. Then, we applied a score threshold to discard the hard negatives from the above set. We found that using an upper threshold rather than a specific rank helped account for the fact that some queries admit much harder top-k negatives than others, and in Section 7.2, we perform a parameter sweep of the negative hardness threshold to demonstrate the value of a tunable approach (the optimal threshold value scores significantly better than other choices). We additionally note that although Algorithm 1 indicates both an upper and lower relevance threshold for negative mining, in practice, we retrieved the top 100 hardest negatives and applied only an upper threshold as a performance optimization.

Refer to caption
Figure 5: Impact of curriculum learning on nDCG@10 retrieval score during training. Scores averaged over a handful of small datasets used for fast in-training evaluation.

Beyond tuning to a single hardness threshold level, we hypothesized that ordering the data by the difficulty of the negatives (i.e., curriculum learning) could lead to even better results. In this vein, we offer the experiment shown in Figure 5, which compares the Impact of training with negatives of progressively increasing difficulty. While this initial experiment suggests some improvement in curating the curriculum of hard negatives, we note that this experiment was run after the release of the Arctic embed, and we did not use this curriculum approach when training our published models.

4 Training Recipe

Variant Pre Batch Pre LR Finetune Batch Fine LR
xs 24,576 6e-4 768 4e-5
s 32,768 5e-4 1,024 4e-5
m 16,384 2e-4 512 1e-5
m-long 12,288 1e-4 512 1e-5
l 12,480 1e-4 512 9e-6
Table 3: Learning rate and batch size for both rounds of training.

4.1 Model Initialization

We begin with a pretrained language model. Where permissively licensed base models pre-trained for information retrieval are available for a given model size, we prefer these weights over general-purpose pretrained ones 131313Such as https://huggingface.co/intfloat/e5-large-unsupervised. Our ablation studies in Sections 7.1 and 7.3 showed mixed results and suggested the effect of this design choice on performance may have been weak relative to other effects studied, but as Figure 6 shows, starting from a more thoroughly trained base model, such as e5-base-unsupervised, had an apparent effect on sample-efficiency and convergence speed, and this speedup was notably helpful for faster experimentation during model development.

Refer to caption
Figure 6: A granular look at our starting weight ablation study from Section 7.1. We plotted the rolling average nDCG@10 score throughout training, observing that the run using pre-trained E5 model weights (red) converged much more quickly than the run using general-purpose BERT weights (blue). The evaluation dataset is a “lite BEIR” dataset based on NQ – details in Section B.1.

4.2 Large Scale Contrastive Pretraining With In-Batch Negatives

In the first round of contrastive training, we aim for large scale, both in batch and total dataset sizes. We use our pretraining dataset with the infoNCE contrastive loss using in-batch negatives (for each query, all documents associated with different queries in the minibatch are treated as negative examples). GPU parallelism, activation checkpointing, and truncated sequence length were instrumental in achieving large batch sizes.

We train for one epoch141414Some sizes of Arctic embed utilized early checkpoints from before one epoch of pretraining, though this was done for expediency, and we did not find evidence of this improved performance. using the AdamW optimizer, adjusting only the learning rate while leaving all other parameters at PyTorch default values. We perform a linear learning rate warmup for several hundred steps, then a linear decay to 10% of the original learning rate over the remainder of the training. As evidenced by the example shown in Figure 10, we observed performance could be sensitive to the learning rate and the learning rate schedule. Batch sizes and learning rates for each model size are given in Table 3.

4.3 Longer Truncation Length

Much of our pretraining data included documents significantly longer than 128 tokens. We used a document sequence length of 256 in large-scale contrastive training, in contrast to the 128 truncation length used in GTE and BGE. We truncated query sequence length to 32, consistent with BGE’s source code151515https://github.com/FlagOpen/FlagEmbedding/blob/53cfac4a50ac0e023b2f8d19b10667c9c210fa41/FlagEmbedding/baai_general_embedding/finetune/arguments.py. Our ablation study in Section 7 suggests this longer truncation length led to a substantial improvement in retrieval performance.

4.4 Source Stratification

We fill each batch with data from a single source during pretraining, a source of accuracy gains in prior work Nussbaum et al. (2024). Our ablation study in Section 7.1 indicates this led to a dramatic improvement in model quality (see Table 5).

4.5 Quality-Focused Contrastive Training With Curated Negatives

After large-scale training, we perform a second round of training leveraging our fine-tuning dataset, which contains explicitly labeled negative examples. We use no learning rate warmup but apply the same linear learning rate decay schedule as in the pretraining stage. We truncate sequence lengths to 512 for queries and documents for all models, including the long-context variant m-long. For each query in a batch, we include one positive document and ten hard negative documents. Batch sizes (number of queries) and learning rates for each model size are given in Table 3.

4.6 Disabling In-Batch Negative Loss

Based on some early fine-tuning runs, we found that disabling in-batch negative loss did not measurably degrade performance. We stopped using in-batch negatives during fine-tuning (this made tuning easier, especially since the interaction between batch size and in-batch loss is not straightforward).

5 Efficiency

To maximize experimental throughput, iteration speed, and the maximum feasible batch size, we took pains to ensure our training setup was as effective as possible for our given computational budget. We carefully optimized the net efficiency of training and iteration, assuming a single training node with 8 NVIDIA H100 GPUs. We achieved high efficiency by carefully implementing a custom data loader and writing our training loop in plain PyTorch to leverage several “tricks” we detail in Section B.2. Additionally, we identified and eliminated careless performance bottlenecks through performance benchmarking, keeping a watchful eye on both throughput and GPU utilization.

Further discussions about methods we found helpful for efficient experimentation and training can be found in the Appendix, including a discussion of granular evaluation during training in Section B.1.

6 Experimental Results

To qualify our retrieval quality, we evaluate model performance on the Retrieval portion of the MTEB dataset Muennighoff et al. (2023). Summary results of MTEB experiments are shown in Figure 1, and a complete tabulation by dataset is given in Appendix E. To quality the performance of our long context model, we leverage the LoCo Saad-Falcon et al. (2024), the results of which are given immediately below.

6.1 Long Context Performance

Model Seq Len Summ. Scr. FD Gov. Report QMSUM QASPER Title QASPER Abs. Avg
arctic-embed-m-long 2048 63.2 93.8 45.8 77.3 95.8 75.2
arctic-embed-m-long 4096 81.8 96.0 39.7 85.9 99.0 80.5
arctic-embed-m-long 8192 86.6 96.5 31.7 81.6 99.7 79.2
jina-base-v2 2048 87.2 97.7 35.1 95.3 99.7 83.0
jina-base-v2 8192 93.3 98.6 40.8 95.1 99.3 85.5
nomic-embed-text-v1 2048 86.1 96.9 47.8 96.1 99.7 85.3
nomic-embed-text-v1 4096 89.0 97.4 45.7 95.8 99.9 85.6
nomic-embed-text-v1 8192 90.9 97.8 44.2 94.9 99.9 85.5
e5-mistral 4096 95.9 98.3 46.8 98.4 99.8 87.8
Table 4: Snowflake-arctic-embed-m-long nDCG@10 scores on the LoCo benchmark. Non-Arctic scores taken from Nussbaum et al. (2024) without an attempt to reproduce.

For our initial Arctic embed release, we did not put any special efforts into adjusting our training recipe for long-context support. Instead, our m-long variant was only trained on short sequence data (it was pretrained with sequences truncated to 256 tokens and finetuned with sequences truncated to 512 tokens). Nonetheless, even on the specialized LoCo long context benchmark datasets (Table 4), performance only tends to lag slightly compared to models trained end-to-end specifically with long-context in mind, e.g. nomic-embed-text-v1. While these LoCo results suggest m-long may not be the model of choice for long sequences, its strong MTEB Retrieval scores suggest it may be a good pick for datasets containing a mix of long and short sequences.

This surprisingly not-so-bad performance may be largely thanks to the base model of m-long, nomic-embed-unsupervised, being trained on long sequence retrieval, but unfortunately we did not have time to run an ablation study to quantify the Impact of this base model.

7 Ablation Studies

We conducted several ablation studies to test some of the hypotheses stated in Table 2 regarding the causes of Arctic embed’s higher MTEB Retrieval scores relative to other similar models. Average scores are given in the following subsections, with full scores in Appendix E.

7.1 Pre-training Ablations

Refer to caption
Figure 7: A granular look at our source stratification ablation study (also see Table 5) showing rolling average nDCG@10 on the SciDocs dataset. A large batch size and source stratification (light blue) delivers the highest performance. Although the random-source large batch size run (purple) drives performance up sharply at the start of training, without source stratification, performance quickly plateaus, falling behind the source-stratified small batch run (dark blue) despite using 4×4\times4 × the data and compute.
Run Base Model Data Stratify Source Batch Size Seq. Length Score
A bert-base-uncased Snowflake Yes 16,384 256 46.97
B e5-unsupervised-base Snowflake Yes 16,384 256 46.96
C bert-base-uncased Nomic Yes 16,384 256 46.55
D bert-base-uncased Snowflake No 16,384 256 43.74
E bert-base-uncased Snowflake Yes 4,096 256 45.36
F bert-base-uncased Snowflake Yes 16,384 128 45.53
Table 5: Large-scale training ablation study. Varied treatment bolded. The score is nDCG@10 on the MTEB Retrieval benchmark.

We probed the effects of batch size, sequence length, base model, and training data in a series of ablations, with resulting MTEB Retrieval scores tabulated in Table 5. In each case we trained for 20k steps using a linear learning rate decay from 2e-4 to 2e-5 after a 300-step linear warmup from 0.161616This ablation setup is slightly different from our published models’ configuration – the warmup was 300 steps instead of 100, gradient clipping was enabled, and 20k steps often slightly exceeded the one epoch through the data used in our published models (often one epoch was around 19k steps). In some cases, we evaluated a one-epoch checkpoint (around 19k steps instead of 20k) to mitigate a data loading correctness issue discovered post-training for the beyond-one-epoch regime for this dataset.

Overall, the ablation study results support our hypotheses about data sourcing, longer sequence length, and source stratification improving model performance. In contrast, the choice of initializing from a pre-trained retrieval model did not significantly impact the MTEB Retrieval score after pretraining. We also notice the interesting curriculum-learning-like pattern of source stratification mattering more later in training than other factors like batch size (see Figure 7).

7.2 Fine-tuning Ablations

As discussed in Section 3.6, our tunable negative mining approach uses a threshold to filter out too-hard negatives. We perform an ablation study on several threshold values to demonstrate the importance of the threshold parameter. The results shown in Figure 8 indicate that too-low and too-high maximum relevancy thresholds (too-hard and too-easy negatives) lead to significantly worse performance.

Refer to caption
Figure 8: Ablation study for different threshold for hard negatives

7.3 End-to-end ablations

Refer to caption
Figure 9: Comparison of various pretrained models under during identical fine-tuning. The performance gap associated with different pretraining datasets appears rather quickly, while the gap associated with different base model weights does not even appear for this dataset (dataset details in Section B.1).
Starting Model Pretrain Data Score After Pretrain Final Score
bert-base Snow 46.97 53.92
e5-unsup. Snow 46.96 54.67
bert-base Nomic 46.55 52.23
Table 6: Final two-stage training scores for our published m model two ablations.

To thoroughly study the effect of training data on the final score, we extended a subset of our pretraining ablation study through the fine-tuning step. We conducted a finetuning step similar to the one used on our published arctic-embed-m model on configurations A, B, and C from Table 5 (different data and base model). The pretraining and fine-tuning trajectories are shown in Figure 9, with final MTEB Retrieval scores in Table 6. Although the performance gap between models pretrained with Snowflake and Nomic data was relatively modest in pretraining, the gap widens substantially with fine-tuning, despite the fine-tuning recipe is the same. We also see a slight improvement in the final score for the configuration using e5-unsupervised-base. We note that our tuning the fine-tuning step to an e5-unsupervised-base model pre-trained on our data may have affected these results.

8 Conclusion and Future Work

By creating the suite of Arctic text embedding models, we sought to better understand how to optimize the training recipe for high-quality text embedding models. Our exploration found that dataset-stratified mini-batches and tuned hard negative mining were crucial ingredients for training a model for more effective retrieval.

In the future, we seek to continue our experimentation to leverage improved curriculum learning and better methods of source stratification. Additionally, we strive to train more robust models to compression approaches such as binarization or quantization of embeddings.

References

Appendix A Dataset Creation Algorithm Details

Algorithms 1 and 2 provide the details of the algorithms used to create our high-quality fine-tuning dataset (see Section 3.3).

Algorithm 1: Tunable Negative Mining
Require: P𝑃Pitalic_P, a dataset of m𝑚mitalic_m query-document pairs P={(q1,d1),(q2,d2),,(qm,dm)}𝑃subscript𝑞1subscript𝑑1subscript𝑞2subscript𝑑2subscript𝑞𝑚subscript𝑑𝑚P=\{(q_{1},d_{1}),(q_{2},d_{2}),\ldots,(q_{m},d_{m})\}italic_P = { ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } D𝐷Ditalic_D, a corpus of n𝑛nitalic_n documents D={d1,d2,,dn}𝐷subscript𝑑1subscript𝑑2subscript𝑑𝑛D=\{d_{1},d_{2},\ldots,d_{n}\}italic_D = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } One option is for D𝐷Ditalic_D is to reuse the documents from P𝑃Pitalic_P r𝑟ritalic_r, a semantic relevance scoring function Maps two pieces of text t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to a real number, i.e. score=r(t1,t2)score𝑟subscript𝑡1subscript𝑡2\text{score}=r(t_{1},t_{2})score = italic_r ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) Should effectively score the semantic relevance of query-query, document-document, and query-document pairs Example implementation: use a preexisting text embedding model to embed the query and document, then compute a vector similarity score via cosine similarity. Parameters: Rmaxsubscript𝑅maxR_{\text{max}}italic_R start_POSTSUBSCRIPT max end_POSTSUBSCRIPT, a real-valued maximum relevance threshold value This threshold defines the degree of relevance below which two items can be considered “irrelevant” for the sake of training. This cutoff helps us attenuate label noise and avoid trying to teach the text embedding model to treat relevant documents as irrelevant. Rminsubscript𝑅minR_{\text{min}}italic_R start_POSTSUBSCRIPT min end_POSTSUBSCRIPT, a real-valued minimum relevance threshold value This threshold defines the degree of relevance below which two items are considered “too obviously irrelevant” for the sake of training. This cutoff helps us keep all negative examples difficult enough for the model. knegsubscript𝑘negk_{\text{neg}}italic_k start_POSTSUBSCRIPT neg end_POSTSUBSCRIPT, the maximum desired number of negatives to mine per query-document pair
1. Initialize dataset Z={}𝑍Z=\{\}italic_Z = { } 2. For each document disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Dabsent𝐷\in D∈ italic_D do (a) Compute relevance scores 𝒔=[s1,,sn]=[r(qi,d1),r(qi,d2),,r(qi,dn)]𝒔subscript𝑠1subscript𝑠𝑛𝑟subscript𝑞𝑖subscript𝑑1𝑟subscript𝑞𝑖subscript𝑑2𝑟subscript𝑞𝑖subscript𝑑𝑛\boldsymbol{s}=[s_{1},\ldots,s_{n}]=[r(q_{i},d_{1}),r(q_{i},d_{2}),\ldots,r(q_% {i},d_{n})]bold_italic_s = [ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = [ italic_r ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_r ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , italic_r ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] (b) Drop from consideration all documents with relevance scores beyond the minimum or maximum relevance threshold parameters, i.e., define 𝒔=[si:RminsiRmax]\boldsymbol{s^{\prime}}=[s_{i}:R_{\text{min}}\leq s_{i}\leq R_{\text{max}}]bold_italic_s start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT = [ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_R start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_R start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ] (c) Use the scores to determine the most relevant documents Dtopk={di,1,,di,k}=topk𝒔(D)subscript𝐷topksubscript𝑑𝑖1subscript𝑑𝑖𝑘subscripttopksuperscript𝒔𝐷D_{\text{topk}}=\{d_{i,1},\ldots,d_{i,k}\}=\operatorname{topk}_{\boldsymbol{s}% ^{\prime}}(D)italic_D start_POSTSUBSCRIPT topk end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT } = roman_topk start_POSTSUBSCRIPT bold_italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_D ) (d) Update Z𝑍Zitalic_Z with the query-document-documents example (qi,di,{di,1,,di,k})subscript𝑞𝑖subscript𝑑𝑖subscript𝑑𝑖1subscript𝑑𝑖𝑘(q_{i},d_{i},\{d_{i,1},\ldots,d_{i,k}\})( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , { italic_d start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT } ), i.e. Zupdated=Z{(qi,di,{di,1,,di,k})}subscript𝑍updated𝑍subscript𝑞𝑖subscript𝑑𝑖subscript𝑑𝑖1subscript𝑑𝑖𝑘Z_{\text{updated}}=Z\cup\{(q_{i},d_{i},\{d_{i,1},\ldots,d_{i,k}\})\}italic_Z start_POSTSUBSCRIPT updated end_POSTSUBSCRIPT = italic_Z ∪ { ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , { italic_d start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT } ) }
Variations: In addition to positive-query-to-negative-document relevance r(qi,dj)𝑟subscript𝑞𝑖subscript𝑑𝑗r(q_{i},d_{j})italic_r ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), positive-document-to-negative-document relevance r(di,dj)𝑟subscript𝑑𝑖subscript𝑑𝑗r(d_{i},d_{j})italic_r ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) can also be used as a signal for mining hard negatives.
Algorithm 2: Synthetic Data Generation
Require: D𝐷Ditalic_D, a corpus of n𝑛nitalic_n documents (same as Algorithm 1) g𝑔gitalic_g, a synthetic query generation function Maps one positive (relevant) document and k𝑘kitalic_k examples of negative (irrelevant) documents to a synthetic query, i.e. q=g(dpos,dneg(1),dneg(2),dneg(k))𝑞𝑔subscript𝑑possuperscriptsubscript𝑑neg1superscriptsubscript𝑑neg2superscriptsubscript𝑑neg𝑘q=g\left(d_{\text{pos}},d_{\text{neg}}^{(1)},d_{\text{neg}}^{(2)}\ldots,d_{% \text{neg}}^{(k)}\right)italic_q = italic_g ( italic_d start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT neg end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT neg end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT … , italic_d start_POSTSUBSCRIPT neg end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ). Example implementation: prompt an instruction-tuned LLM using the positive and negative example documents. r𝑟ritalic_r, a semantic relevance scoring function (same as Algorithm 1)
1. Initialize empty paired dataset P={}𝑃P=\{\}italic_P = { } 2. For each document disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Dabsent𝐷\in D∈ italic_D do: (a) Identify the top k𝑘kitalic_k documents with highest semantic relevance i. Compute relevance scores 𝒔=[r(qi,d1),r(qi,d2),,r(qi,dn)]𝒔𝑟subscript𝑞𝑖subscript𝑑1𝑟subscript𝑞𝑖subscript𝑑2𝑟subscript𝑞𝑖subscript𝑑𝑛\boldsymbol{s}=[r(q_{i},d_{1}),r(q_{i},d_{2}),\ldots,r(q_{i},d_{n})]bold_italic_s = [ italic_r ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_r ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , italic_r ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] ii. Use the scores to determine the most relevant documents Dtopk={di,1,,di,k}=topk𝒔(D)subscript𝐷topksubscript𝑑𝑖1subscript𝑑𝑖𝑘subscripttopk𝒔𝐷D_{\text{topk}}=\{d_{i,1},\ldots,d_{i,k}\}=\operatorname{topk}_{\boldsymbol{s}% }(D)italic_D start_POSTSUBSCRIPT topk end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT } = roman_topk start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT ( italic_D ) (b) Use generate a synthetic query qi=g(di,di,1,,di,k)subscript𝑞𝑖𝑔subscript𝑑𝑖subscript𝑑𝑖1subscript𝑑𝑖𝑘q_{i}=g(d_{i},d_{i,1},\ldots,d_{i,k})italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ) (c) Update P𝑃Pitalic_P with the query pair (qi,di)subscript𝑞𝑖subscript𝑑𝑖(q_{i},d_{i})( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), i.e. Pupdated=P{(qi,di)}subscript𝑃updated𝑃subscript𝑞𝑖subscript𝑑𝑖P_{\text{updated}}=P\cup\{(q_{i},d_{i})\}italic_P start_POSTSUBSCRIPT updated end_POSTSUBSCRIPT = italic_P ∪ { ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } 3. Apply Algorithm 1 to mine negative documents from paired data P𝑃Pitalic_P and corpus D𝐷Ditalic_D using relevance function r𝑟ritalic_r
Variations: In addition to a positive document and a sequence of negative documents, the query generation function can also be constructed to accept an example query to ground the generation. In this case, paired data consisting of query-document pairs becomes required, but higher-quality outputs may be possible in the case of high-quality grounding queries. In this setting, query-document relevance can also be used as a signal for selecting negative examples. Randomization—rather than taking the top k𝑘kitalic_k most relevant documents, one can select the top k>ksuperscript𝑘𝑘k^{\prime}>kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_k documents and randomly sample k𝑘kitalic_k of these. This approach (and similar randomization approaches) can generate multiple queries from a single document (or query-document pair if the above variation is used). Relevance thresholding – similar to the machinery of Algorithm 1, one can adjust the relevant negative examples used in query generation by discarding examples deemed “too relevant” or “too irrelevant” as per thresholds on relevance score.
Algorithm 3: Synthetic Data Generation Prompt
You are a search quality rater tasked with evaluating the effectiveness of a search engine. You aim to generate a plausible query that retrieves a specific document when executed on a high-performing search engine. The query should be relevant to the document’s content and make sense without access to the relevant document. Sample Document: – BEGIN SAMPLE DOCUMENT – SAMPLE_DOC – END SAMPLE DOCUMENT – Sample Generated Query: – BEGIN SAMPLE GENERATED QUERY – SAMPLE_QUERY – END SAMPLE GENERATED QUERY – Document Details: – BEGIN DOCUMENT TEXT – DOCUMENT_TEXT – END DOCUMENT TEXT – Irrelevant Documents: Irrelevant Document 1: WRONG_1 Irrelevant Document 2: WRONG_2 Irrelevant Document 3: WRONG_3 Irrelevant Document 4: WRONG_4.
Instructions: Read the document text and consider potential user actions. Reflect on the document’s content and utility. Review the irrelevant documents to understand query nuances. Create a query (Q) that retrieves the target document but excludes irrelevant ones.
Considerations: Does the query fully address the user’s intent? Does the query uniquely identify the target document? Comprehensive Explanation: Provide a detailed explanation of why the generated query effectively retrieves the target document and excludes irrelevant ones. Ensure the explanation precedes the query for clarity and evaluation purposes.
Respond only with the JSON object comprising keys E and Q.

Appendix B Additional Efficiency Details

Table 7 quantifies our training efficiency in terms of throughput.

Training State Batches Doc/Batch Doc Max Length Elapsed Doc/Sec
Large Scale In-Batch Negatives 18,798 16,384 256 17h3m 5,018
Smaller Scale Hard Negatives 7,845 5,632 512 7h40m 1,601
Table 7: Training efficiency for arctic-embed-m on 8 NVIDIA H100 GPUs. The in-training evaluation is included in the runtime.

B.1 Fast Feedback With “Lite” Datasets

We found it valuable to conduct information retrieval evaluation at regular intervals throughout the training process, not just at the end, as this uncovered significant trends not captured by end-of-training evaluation (see Figure 7, for example). To enable these evaluations, we constructed “lite” versions of several large BEIR datasets by starting with a sample of queries (e.g., a few hundred) and combining the labeled-as-relevant with the most relevant documents as determined using a preexisting text embedding model (we often used 100 documents per query) to form a corpus far smaller than the original yet still difficult given the query test set. We found that these “lite” datasets offered a cheap way to anticipate full-scale performance trends at a small fraction of the compute cost required by the entire dataset.

We could evaluate retrieval performance on a diverse set of domains throughout training through a combination of already-small BEIR datasets and our new “lite BEIR” datasets. We achieved excellent throughput by implementing evaluation in the same Distributed Data-Parallel paradigm as our training. We were able to embed and score ~five datasets of this size in ~30 seconds, netting us useful nDCG@10 scores as often as every 100 steps of training with only modest runtime overhead (Figures 7, 6 and 10, which are screenshots of our in-training evaluation, demonstrate just how granular and proper this evaluation frequency is in practice).

B.2 Efficiency Tricks

Training Efficiency Tricks

Appendix C List of Data Quality Filter Heuristics

  • Language: Only include documents primarily identified as English using the fasttext language classifier threshold.

  • Document Length: The number of normalized words in each document should be between 10 and 10,000. This filter removes very short or extremely long documents.

  • Word Length: After normalization, the mean length of words should be between 3 and 10 characters. This filter helps remove documents with excessively short or long words.

  • Symbol Density: The document’s ratio of symbols to words should be less than 0.1 (10%). This filter removes documents with an excessive number of symbols or special characters.

  • Ellipsis Line: The fraction of lines that end with an ellipsis (…) should be less than 0.3 (30%). This filter removes documents with an excessive number of lines ending with ellipses.

  • Non-Alphabetical Word: The fraction of words that contain no alphabetical characters should be less than 0.2 (20%). This filter removes documents with excessive non-alphabetical words (e.g., numeric strings, symbols).

  • Perplexity: The perplexity score (from https://github.com/kpu/kenlm) should be less than 10,000. This filter removes documents that are less likely to be understandable or meaningful.

  • N-Gram Duplication: Limit the fraction of characters in duplicate n-grams (sequences of n words) for n ranging from 2 to 10. These filters help remove documents with excessive repetition or duplication of phrases.

  • Stop Word: The document should contain at least one stop word (common words like “the”, “and”, “is”). This filter helps remove documents without meaningful content.

  • Bullet Point Line: The density of lines starting with bullet points should be less than 0.9 (90%). This filter removes documents that are mostly composed of bullet points or lists.

  • Blacklist Domain: Domain of document URL cannot be blacklisted based on the https://dsi.ut-capitole.fr/blacklists/index_en.php. This filter helps remove documents from known low-quality, adult, or spam domains.

  • Short Line: The fraction of lines with fewer than five words should be less than 0.1 (10%). This filter removes documents with an excessive number of concise lines.

  • Numeric Line: The fraction of lines containing only numeric characters should be less than 0.05 (5%). This filter removes documents with excessive lines consisting solely of numbers.

  • Uppercase Line: The fraction of lines with more than 80% uppercase letters should be less than 0.05 (5%). This filter removes documents with excessive lines in all uppercase letters.

Appendix D Anecdata On The Importance Of Tuning

Throughout the development of Arctic Embed, we found the time-honored technique of guess-and-check tuning to be critical for ensuring a quality final model. By investing in efficient training and high-granularity in-training evaluation (see Section 5), we were able to carry out dozens of experiments ranging from one-off wild guesses (“YOLO runs”) to large-scale parameter sweeps spanning ten or more trials in the background of our work to build our in-house datasets and finish our training code implementation.

Refer to caption
(a) nDCG@10 score averaged across several datasets
Refer to caption
(b) Loss curves
Figure 10: Example of surprising sensitivity to learning rate schedule. The yellow lines track a 20k step training run, while the green lines show the same data and hyperparameters extended to an 80k step run, with the only difference in the first steps being the slower linear learning rate decay in the longer run. Although learning rates are still similar at 6k steps (approximately 0.00015 and 0.00019 for yellow and green, respectively), the learning trajectories diverge sharply around this point, with the 20k schedule learning faster both in terms of downstream IR performance (Figure 10a) and in-sample contrastive loss (Figure 10b).
Stratify Source Steps Score
Yes 20k 45.36
Yes 80k 46.13
No 20k 42.42
No 80k 42.79
Table 8: Extended study of 4k batch size runs demonstrating better performance at fewer steps. A one-epoch ( 75k step) checkpoint is used in the 80k stratified due to a data loading bug.
Refer to caption
Figure 11: Hyperparameter sweep for different batch sizes and learning rates.

We were surprised quite often in our tuning. Sometimes, we were surprised by an unexpected insensitivity to presumably important hyper-parameters, as was the case when we experimented with various batch size settings and learning rates for fine-tuning as shown in Figure 11. Other times, we were surprised by unexpected sensitivity, like the unforeseen differences in trajectory between shorter and longer training runs during (see the comparisons in Table 8 and Figure 10 for example – where we see that training for more steps with our linear learning rate decay schedule leads to much worse performance at the beginning of training).

Although we ran many informal experiments while developing Arctic embed, we believe there is still much to be learned about the peculiarities of tuning text embedding training.

Appendix E Full MTEB Score Breakdown

baseline using e5-base-unsupervised using nomic data using sequence length 128 4k batch 80k step stratified 4k batch 80k step unstratified 4k batch 20k step stratified 4k batch 20k step unstratified without source stratification baseline, finetuned using e5-base-unsupervised, finetuned using nomic data, finetuned
ArguAna 45.14 43.95 53.66 39.69 44.13 39.70 44.75 42.18 45.65 58.77 55.44 51.78
CQADupstackRetrieval 42.26 42.24 41.57 40.80 40.71 37.36 40.15 36.90 39.05 43.24 44.11 42.75
ClimateFEVER 20.35 23.75 22.09 21.33 20.73 19.51 18.47 19.78 18.79 35.55 38.82 32.57
DBPedia 38.97 37.83 38.35 36.92 38.77 37.58 39.27 35.77 37.02 44.68 45.81 43.07
FEVER 71.65 71.39 69.59 64.42 70.61 71.04 72.55 68.34 68.82 87.55 88.46 86.78
FiQA2018 42.09 45.20 37.40 40.46 39.70 35.30 38.73 34.92 37.67 40.44 42.88 36.79
HotpotQA 57.77 53.97 60.52 56.69 59.10 47.22 60.54 47.22 46.27 71.77 73.03 69.84
MSMARCO 33.85 33.75 34.35 33.77 33.12 31.46 32.79 31.05 32.40 41.73 41.89 40.26
NFCorpus 33.83 36.28 36.25 32.83 34.28 32.71 32.81 31.90 32.58 34.63 36.16 34.98
NQ 46.92 47.07 48.46 46.68 45.81 37.30 44.87 37.55 38.85 60.37 61.82 58.48
QuoraRetrieval 87.20 87.45 88.13 87.30 86.99 86.25 86.51 86.10 86.58 87.32 87.60 87.94
SCIDOCS 20.39 21.54 21.62 20.24 20.02 17.88 19.09 18.30 18.58 20.20 21.09 21.00
SciFact 69.42 70.97 74.97 68.90 68.68 65.99 66.89 66.88 68.36 70.58 73.34 75.16
TRECCOVID 71.32 66.99 50.91 70.96 66.82 62.14 61.23 59.14 63.51 79.98 80.27 74.70
Touche2020 23.38 21.97 20.36 22.00 22.42 20.34 21.76 20.22 22.03 31.95 29.38 27.40
Overall Average 46.97 46.96 46.55 45.53 46.13 42.79 45.36 42.42 43.74 53.92 54.67 52.23
Table 9: Full Per Dataset NDCG@10 MTEB Retrieval scores by dataset for each ablation variant.