Assignment: GraphRAG Implementation#
Assignment Metadata#
Field |
Description |
|---|---|
Assignment Name |
GraphRAG with Neo4j and Entity Extraction |
Course |
RAG and Optimization |
Project Name |
|
Estimated Time |
150 minutes |
Framework |
Python 3.10+, LangChain, Neo4j, OpenAI API, Pydantic |
Learning Objectives#
By completing this assignment, you will be able to:
Design a GraphRAG architecture combining graph and vector databases
Implement entity and relationship extraction from documents using LLMs
Build and populate a knowledge graph in Neo4j
Create Cypher queries for graph traversal and retrieval
Integrate graph-based retrieval with LLM answer generation
Problem Description#
Your organization has policy documents, contracts, and technical specifications that contain rich relationships between entities (stakeholders, regulations, commitments, etc.). Traditional vector search struggles to answer queries like:
“Which policies affect both Employees and Partners?”
“What commitments have measurable constraints?”
“Show all regulations referenced by the Leave Policy”
Your task is to implement a GraphRAG system that extracts entities and relationships, stores them in Neo4j, and enables relationship-aware retrieval.
Technical Requirements#
Environment Setup#
Python 3.10 or higher
Neo4j Database (Desktop or Docker)
Required packages:
langchain>= 0.1.0langchain-neo4j>= 0.1.0openai>= 1.0.0pydantic>= 2.0.0doclingorpypdffor document processing
Neo4j Setup#
# Docker setup
docker run -d --name neo4j \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/password \
neo4j:latest
Tasks#
Task 1: Define Domain Schema (20 points)#
Design Pydantic models for your domain entities:
Identify at least 4 entity types from your documents
Define relationships between entities
Include constraints and measurable properties
Example schema structure:
class Entity(BaseModel): name: str type: str properties: dict class Relationship(BaseModel): source: str target: str relation_type: str
Document your schema with:
Entity type descriptions
Relationship type definitions
Example instances from your domain
Task 2: Entity and Relationship Extraction (30 points)#
Implement an extraction pipeline:
Load and chunk documents
Use LLM with structured output to extract entities
Extract relationships between entities
Handle extraction errors and edge cases
Design extraction prompts that:
Provide clear instructions for entity identification
Include examples (few-shot learning)
Specify output format matching your Pydantic models
Quality checks:
Validate extracted entities against schema
Handle duplicate entities across chunks
Log extraction statistics (entities/chunk, relationship types)
Task 3: Build Knowledge Graph (25 points)#
Populate Neo4j with extracted data:
Create nodes for each entity type
Create relationships between entities
Use
MERGEto prevent duplicatesAdd properties to nodes and relationships
Implement graph queries:
Count entities by type
Find entities with specific relationships
Traverse multi-hop relationships
Aggregate information across connected nodes
Example queries to implement:
// Find all entities related to a specific policy MATCH (p:Policy {name: $policy_name})-[r]->(e) RETURN p, r, e
Task 4: GraphRAG Query Pipeline (25 points)#
Implement natural language to Cypher translation:
Use
GraphCypherQAChainor custom implementationHandle query validation and error recovery
Support common question patterns
Create a test set with 10 queries:
Entity lookup queries (5)
Relationship traversal queries (3)
Aggregation queries (2)
Demonstrate answers that:
Leverage graph relationships
Would be difficult/impossible with vector search alone
Combine information from multiple connected entities
Submission Requirements#
Required Deliverables#
Source code (Jupyter notebook or Python scripts)
README.mdwith setup and usage instructionsSchema documentation (entity types, relationships)
Sample Cypher queries and results
Screenshots of Neo4j graph visualization
Submission Checklist#
Pydantic models correctly validate extracted data
Extraction pipeline processes documents without errors
Neo4j graph is populated with entities and relationships
Natural language queries return correct results
Documentation explains the graph schema design decisions
Evaluation Criteria#
Criteria |
Points |
|---|---|
Schema design quality |
15 |
Extraction pipeline correctness |
20 |
Prompt engineering effectiveness |
10 |
Graph population implementation |
15 |
Cypher query implementation |
15 |
Query pipeline integration |
15 |
Code quality and documentation |
10 |
Total |
100 |
Hints#
Start with a small document set (2-3 pages) to iterate on your schema
Use
model.with_structured_output()for reliable JSON extraction from LLMsTest Cypher queries in Neo4j Browser before implementing in code
Consider the companion notebooks
05-graph_rag_v1.ipynband05-graph_rag_v2.ipynbThe sample
FSoft_HR.pdfprovides a good starting point for HR policy extraction