Multi-View Scientific Citation Prediction
Main Reference Checker Notebook
Multi-View Scientific Citation Prediction
This notebook builds a binary classifier for inter-paper citation prediction. For each candidate pair (paper, referenced_paper), the task is to decide whether the source paper truly cites the candidate reference. The workflow integrates three complementary evidence layers:
- Global semantic evidence from full-document embeddings.
- Local semantic evidence from chunk-to-chunk similarity statistics.
- Bibliometric and metadata evidence from publication year, author overlap, concepts, titles, and citation counts.
The central hypothesis is that citation links are not explained by topical similarity alone. A scientifically plausible citation usually requires simultaneous agreement in semantic content, local contextual alignment, and bibliographic feasibility. The downstream model therefore treats citation prediction as a multi-view classification problem rather than a pure nearest-neighbor retrieval task.
Evaluation Principle
The competition metric is Matthews Correlation Coefficient (MCC). MCC is preferable to raw accuracy in imbalanced binary problems because it incorporates all four entries of the confusion matrix and penalizes trivial majority-class solutions.
1. Data Ingestion and Corpus Construction
We first assemble the minimal corpus required by the train and test pairs. Instead of reading every paper repeatedly during feature construction, the notebook creates a union of all paper identifiers that appear in either endpoint of a candidate pair, then materializes each full text once into the paper_contents dictionary.
This design reduces disk I/O and ensures that all downstream feature functions operate on a consistent in-memory text store. Missing files are mapped to empty strings, which is a conservative fallback: the pair remains usable through metadata-derived signals even when the raw text is unavailable.
2. Global Document Representation
Each paper is embedded at document scale by concatenating its title with the full text and encoding the result using a transformer model. The helper function is model-agnostic, but the execution stage below uses allenai/specter, which is specifically designed for scientific documents and is therefore a strong prior for citation-oriented similarity.
For a paper pair (p, r), the global semantic signal is later summarized by
A single dense vector per paper captures broad topical relatedness and provides the first approximation of whether two papers belong to the same scientific conversation.
3. Local Context Modeling by Sliding Windows
Full-document embeddings are useful, but they can miss narrow citation-relevant evidence such as a short methodological borrowing, a shared experimental setup, or a localized theoretical claim. To recover that finer structure, the notebook defines overlapping windows over the paper text.
The overlap is intentional: it reduces boundary artifacts and increases the probability that important phrases survive intact in at least one chunk. The helper functions below establish the conceptual machinery for passage-level similarity before the pipeline switches to a more efficient precomputed chunk encoder.
4. Pairwise Document-Level Feature Engineering
The next feature block combines semantic similarity with bibliometric priors. For each candidate citation pair, the notebook derives:
- full-document cosine similarity,
- temporal compatibility (
year_diff,can_cite,same_year), - citation-count asymmetry,
- author and concept overlap,
- document-type agreement,
- lexical title similarity,
- a weak heuristic indicating whether the referenced title appears in the source text.
Scientifically, these variables represent different mechanisms behind real citation behavior: content relevance, chronology, scholarly community overlap, conceptual proximity, and explicit textual mention. The objective is to move from raw embeddings into interpretable pairwise evidence.
4.1 Embedding Generation
All unique papers are embedded once and then reused across every train and test pair. This is computationally critical because transformer inference is expensive, while the number of candidate pairs can be much larger than the number of unique papers.
Generating paper embeddings... Loading model: allenai/specter
tokenizer_config.json: 0%| | 0.00/321 [00:00<?, ?B/s]
config.json: 0%| | 0.00/612 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/222k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/112 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/440M [00:00<?, ?B/s]
Using device: cuda
model.safetensors: 0%| | 0.00/440M [00:00<?, ?B/s]
Generated embeddings for 4354 papers with dimension 768
4.2 Materializing the Document-Level Feature Tables
Once paper embeddings are cached, the row-wise feature constructor is applied to each candidate edge. The result is a tabular train/test representation where every row corresponds to one potential citation link and every column captures a global semantic or metadata-derived signal.
Computing document‐level features for train...
Computing document‐level features for test...
5. Efficient Chunk-Level Similarity Features
The production chunk pipeline uses all-MiniLM-L6-v2 to encode short text windows. This encoder is much lighter than a full scientific transformer and is therefore suitable for large-scale chunk comparisons. The selected chunk scheme selected = (25, 10) corresponds to a 25-token window with a stride of 10 tokens, which balances local context preservation with manageable computational cost.
Chunk embeddings are precomputed once per paper and serialized to disk. Precomputation converts an otherwise expensive nested inference problem into a reusable lookup table, which is essential when the same paper appears in many candidate citation pairs.
modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]
config_sentence_transformers.json: 0%| | 0.00/116 [00:00<?, ?B/s]
README.md: 0%| | 0.00/10.5k [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]
config.json: 0%| | 0.00/612 [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/90.9M [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/350 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/112 [00:00<?, ?B/s]
config.json: 0%| | 0.00/190 [00:00<?, ?B/s]
Saved chunk embeddings for scheme (25,10) to chunk_embeddings_25_10.pkl
Loaded GPU chunk embeddings for schemes: [(25, 10)]
5.1 Streaming Chunk Similarity Statistics
Given chunk embedding matrices and , the notebook avoids storing the full similarity matrix in memory. Instead, it computes blockwise similarities on GPU and summarizes the distribution through:
- maximum similarity,
- mean similarity,
- standard deviation,
- fraction of similarities above a high threshold,
- aggregate summaries across chunking schemes.
This matters scientifically because a true citation may appear either as one extremely sharp local match or as a broader pattern of moderate alignment across multiple passages.
6. Multi-View Feature Fusion
The document-level and chunk-level tables are concatenated row-wise to form a single supervised learning matrix. This fusion is the key modeling decision of the notebook: global semantics explain coarse relevance, local chunk statistics capture fine-grained alignment, and the combined representation gives the final classifier access to nonlinear interactions between the two views.
text_similarity float64 year_diff int64 can_cite int64 cited_by_count_ratio float64 cited_by_count_ref int64 cited_by_count_paper int64 same_year int64 author_overlap float64 concept_overlap float64 same_type int64 title_similarity float64 contains_citation_text int64 paper object referenced_paper object is_referenced int64 max_chunk_sim_25_10 float64 mean_chunk_sim_25_10 float64 std_chunk_sim_25_10 float64 frac_above80_25_10 float64 avg_chunk_sim float64 max_chunk_sim float64 chunk_sim_variance float64 dtype: object
text_similarity float64 year_diff int64 can_cite int64 cited_by_count_ratio float64 cited_by_count_ref int64 cited_by_count_paper int64 same_year int64 author_overlap float64 concept_overlap float64 same_type int64 title_similarity float64 contains_citation_text int64 paper object referenced_paper object is_referenced int64 max_chunk_sim_25_10 float64 mean_chunk_sim_25_10 float64 std_chunk_sim_25_10 float64 frac_above80_25_10 float64 avg_chunk_sim float64 max_chunk_sim float64 chunk_sim_variance float64 dtype: object
7. Gradient-Boosted Tabular Modelling
After constructing the primary scientific features, the workflow shifts from representation building to supervised tabular learning. The objective here is not to learn new text encoders, but to integrate heterogeneous evidence through a robust nonlinear classifier. Gradient-boosted trees are well suited to this setting because they can absorb mixed feature scales, capture threshold effects, and model higher-order interactions without extensive preprocessing.
The next cells reload the fused feature tables, enrich them with additional categorical context, expand them with grouped statistics, prune the feature space, and fit the final XGBoost classifier used for submission.
| text_similarity | year_diff | can_cite | cited_by_count_ratio | cited_by_count_ref | cited_by_count_paper | same_year | author_overlap | concept_overlap | same_type | ... | paper | referenced_paper | is_referenced | max_chunk_sim_25_10 | mean_chunk_sim_25_10 | std_chunk_sim_25_10 | frac_above80_25_10 | avg_chunk_sim | max_chunk_sim | chunk_sim_variance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.760211 | 3 | 1 | 6.960894 | 2492 | 357 | 0 | 0.0 | 0.250000 | 1 | ... | p2128 | p3728 | 0 | 0.653185 | 0.108310 | 0.092632 | 0.0 | 0.653185 | 0.653185 | 0.0 |
| 1 | 0.631514 | -24 | 0 | 1.097760 | 1078 | 981 | 0 | 0.0 | 0.000000 | 0 | ... | p0389 | p3811 | 0 | 0.595115 | 0.129211 | 0.090438 | 0.0 | 0.595115 | 0.595115 | 0.0 |
| 2 | 0.610870 | 2 | 1 | 0.011196 | 182 | 16255 | 0 | 0.0 | 0.000000 | 0 | ... | p1298 | p3760 | 0 | 0.586281 | 0.114077 | 0.083889 | 0.0 | 0.586281 | 0.586281 | 0.0 |
| 3 | 0.714809 | 7 | 1 | 0.079337 | 1479 | 18641 | 0 | 0.0 | 0.111111 | 1 | ... | p0211 | p1808 | 0 | 0.603561 | 0.093274 | 0.089614 | 0.0 | 0.603561 | 0.603561 | 0.0 |
| 4 | 0.771843 | 26 | 1 | 0.806250 | 645 | 799 | 0 | 0.0 | 0.000000 | 0 | ... | p0843 | p2964 | 0 | 0.721893 | 0.093263 | 0.094657 | 0.0 | 0.721893 | 0.721893 | 0.0 |
5 rows × 22 columns
| paper_id | doi | title | publication_year | publication_date | cited_by_count | type | authors | concepts | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | p0000 | https://doi.org/10.1161/circulationaha.115.001593 | Machine Learning in Medicine | 2015 | 11/16/2015 | 2662 | review | Rahul C. Deo | Medicine; Medical physics; Medical education; ... |
| 1 | p0001 | https://doi.org/10.1504/ijmmno.2013.055204 | A literature survey of benchmark functions for... | 2013 | 1/1/2013 | 1138 | article | Momin Jamil; Xin‐She Yang | Benchmark (surveying); Set (abstract data type... |
| 2 | p0002 | https://doi.org/10.1109/icip.2017.8296547 | Abnormal event detection in videos using gener... | 2017 | 9/1/2017 | 486 | article | Mahdyar Ravanbakhsh; Moin Nabi; Enver Sanginet... | Abnormality; Computer science; Artificial inte... |
| 3 | p0003 | https://doi.org/10.3115/v1/p15-1001 | On Using Very Large Target Vocabulary for Neur... | 2015 | 1/1/2015 | 916 | article | Sébastien Jean; Kyunghyun Cho; Roland Memisevi... | Machine translation; Computer science; Vocabul... |
| 4 | p0004 | https://doi.org/10.1109/tpami.2007.1167 | Gaussian Process Dynamical Models for Human Mo... | 2007 | 12/20/2007 | 1016 | article | Jonathan M. Wang; David J. Fleet; Aaron Hertzmann | Gaussian process; Artificial intelligence; Lat... |
7.1 Re-attaching Categorical Author Context
Author fields are merged back for both the citing paper and the candidate reference. These variables are later available as categorical grouping keys, allowing the model to compare a pair not only in absolute terms but also relative to typical author-linked patterns in the data.
['paper', 'referenced_paper', 'authors', 'referenced_authors']
7.2 Memory-Aware Typing
Feature expansion can become wide and memory intensive. The next step downcasts numeric columns and converts object fields into categorical dtype. This optimization does not alter the scientific interpretation of the variables; it simply makes the subsequent aggregation and model-fitting stages feasible in a constrained notebook environment.
Memory usage of dataframe is 75.20 MB Memory usage after optimization is: 18.76 MB Decreased by 75.1% Memory usage of dataframe is 61.53 MB Memory usage after optimization is: 16.38 MB Decreased by 73.4%
referenced_authors
Kaiming He; Xiangyu Zhang; Shaoqing Ren; Jian Sun 440
Guobao Wang; Jinyi Qi 439
Luciano Floridi 438
David Mackay 419
Charles F. Manski 339
...
Li Huang; Lei Wang 78
Yan-Kun Chen; Jingxuan Liu; Lingyun Peng; Yiqi Wu; Yige Xu 76
Jack Stilgoe 76
Wang Feng; Xiang Xiang; Jian Cheng; Alan Yuille 72
Samuel Gehman; Suchin Gururangan; Maarten Sap; Yejin Choi; Noah A. Smith 71
Name: count, Length: 3642, dtype: int647.3 Group-Conditioned Aggregate and Deviation Features
For each categorical key, numeric variables are summarized through group-level statistics such as mean, median, max, min, standard deviation, and variance. The notebook then computes difference-from-group baselines for selected statistics.
Methodologically, this acts as a contextual normalization layer. A feature value is often more informative when interpreted relative to the typical profile of similar papers, references, or authors than when it is viewed in isolation.
/tmp/ipykernel_19/1729623842.py:33: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. .groupby(group_list)[numeric_cols] /tmp/ipykernel_19/1729623842.py:33: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. .groupby(group_list)[numeric_cols] /tmp/ipykernel_19/1729623842.py:33: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. .groupby(group_list)[numeric_cols] /tmp/ipykernel_19/1729623842.py:33: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. .groupby(group_list)[numeric_cols]
(410691, 633)
7.4 Model-Based Feature Selection
A preliminary XGBoost model is trained on numeric features to rank variables by predictive importance. Only the top 200 features are retained for the final stage. The purpose is both pragmatic and statistical: reduce noise, control dimensionality, and preserve the most discriminative cross-view signals before full training.
(410691, 201)
7.5 Final Training with Imbalance Handling
The target is binary and typically imbalanced, so the classifier uses scale_pos_weight to rebalance the influence of positive citation edges. XGBoost is a suitable final learner here because citation decisions are driven by nonlinear interactions, for example high local overlap that only matters when chronological feasibility is satisfied.
7.6 Validation with Matthews Correlation Coefficient
Performance is estimated with stratified 5-fold cross-validation and scored using MCC. This validation choice is important because MCC remains informative under class imbalance and directly measures whether the classifier is learning a balanced citation decision boundary rather than exploiting majority-class prevalence.
CV MCC scores: [0.6080689 0.62554884 0.61562595 0.60015708 0.59914916] Mean MCC : 0.609709985759617
7.7 Submission Export
After cross-validated sanity checking, the final model is refit on the full training set and used to generate the competition submission file. At this point the notebook has transformed raw scientific text and metadata into an end-to-end citation prediction pipeline.
Wrote submission.csv with shape (336021, 2)