Module 1 · AI

📖 4 min read · By ['[“HungHM15”]']

Assignment: GraphRAG Implementation#

Assignment Metadata#

Field	Description
Assignment Name	GraphRAG with Neo4j and Entity Extraction
Course	RAG and Optimization
Project Name	`graph-rag-system`
Estimated Time	150 minutes
Framework	Python 3.10+, LangChain, Neo4j, OpenAI API, Pydantic

Learning Objectives#

By completing this assignment, you will be able to:

Design a GraphRAG architecture combining graph and vector databases
Implement entity and relationship extraction from documents using LLMs
Build and populate a knowledge graph in Neo4j
Create Cypher queries for graph traversal and retrieval
Integrate graph-based retrieval with LLM answer generation

Problem Description#

Your organization has policy documents, contracts, and technical specifications that contain rich relationships between entities (stakeholders, regulations, commitments, etc.). Traditional vector search struggles to answer queries like:

“Which policies affect both Employees and Partners?”
“What commitments have measurable constraints?”
“Show all regulations referenced by the Leave Policy”

Your task is to implement a GraphRAG system that extracts entities and relationships, stores them in Neo4j, and enables relationship-aware retrieval.

Technical Requirements#

Environment Setup#

Python 3.10 or higher
Neo4j Database (Desktop or Docker)
Required packages:
- langchain >= 0.1.0
- langchain-neo4j >= 0.1.0
- openai >= 1.0.0
- pydantic >= 2.0.0
- docling or pypdf for document processing

Neo4j Setup#

# Docker setup
docker run -d --name neo4j \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/password \
  neo4j:latest

Tasks#

Task 1: Define Domain Schema (20 points)#

Design Pydantic models for your domain entities:
- Identify at least 4 entity types from your documents
- Define relationships between entities
- Include constraints and measurable properties

Example schema structure:

class Entity(BaseModel):
    name: str
    type: str
    properties: dict

class Relationship(BaseModel):
    source: str
    target: str
    relation_type: str

Document your schema with:
- Entity type descriptions
- Relationship type definitions
- Example instances from your domain

Task 2: Entity and Relationship Extraction (30 points)#

Implement an extraction pipeline:
- Load and chunk documents
- Use LLM with structured output to extract entities
- Extract relationships between entities
- Handle extraction errors and edge cases
Design extraction prompts that:
- Provide clear instructions for entity identification
- Include examples (few-shot learning)
- Specify output format matching your Pydantic models
Quality checks:
- Validate extracted entities against schema
- Handle duplicate entities across chunks
- Log extraction statistics (entities/chunk, relationship types)

Task 3: Build Knowledge Graph (25 points)#

Populate Neo4j with extracted data:
- Create nodes for each entity type
- Create relationships between entities
- Use MERGE to prevent duplicates
- Add properties to nodes and relationships
Implement graph queries:
- Count entities by type
- Find entities with specific relationships
- Traverse multi-hop relationships
- Aggregate information across connected nodes

Example queries to implement:

// Find all entities related to a specific policy
MATCH (p:Policy {name: $policy_name})-[r]->(e)
RETURN p, r, e

Task 4: GraphRAG Query Pipeline (25 points)#

Implement natural language to Cypher translation:
- Use GraphCypherQAChain or custom implementation
- Handle query validation and error recovery
- Support common question patterns
Create a test set with 10 queries:
- Entity lookup queries (5)
- Relationship traversal queries (3)
- Aggregation queries (2)
Demonstrate answers that:
- Leverage graph relationships
- Would be difficult/impossible with vector search alone
- Combine information from multiple connected entities

Evaluation Criteria#

Criteria	Points
Schema design quality	15
Extraction pipeline correctness	20
Prompt engineering effectiveness	10
Graph population implementation	15
Cypher query implementation	15
Query pipeline integration	15
Code quality and documentation	10
Total	100

Hints#

Start with a small document set (2-3 pages) to iterate on your schema
Use model.with_structured_output() for reliable JSON extraction from LLMs
Test Cypher queries in Neo4j Browser before implementing in code
Consider the companion notebooks 05-graph_rag_v1.ipynb and 05-graph_rag_v2.ipynb
The sample FSoft_HR.pdf provides a good starting point for HR policy extraction