From Text to Knowledge
Simon Frost
Introduction
SemanticSpacetime.jl can automatically extract knowledge structure from plain text. The text_to_n4l function analyses text for intentionality — identifying the most significant sentences — and formats them as N4L. This vignette demonstrates the full text→N4L→graph→search pipeline.
Setup
using SemanticSpacetime
SemanticSpacetime.reset_arrows!()
SemanticSpacetime.reset_contexts!()
const SST_CONFIG = joinpath(@__DIR__, "..", "..", "..", "SSTorytime", "SSTconfig")
println("Config available: ", isdir(SST_CONFIG))Config available: trueStep 1: Text to N4L Conversion
Let’s start with a factual paragraph about climate science:
factual_text = """
The global mean surface temperature has increased by approximately 1.1 degrees
Celsius since the pre-industrial era. This warming is primarily driven by
anthropogenic greenhouse gas emissions, particularly carbon dioxide from fossil
fuel combustion. The ocean has absorbed more than 90 percent of the excess heat,
leading to thermal expansion and contributing to sea level rise. Arctic sea ice
extent has declined significantly, with September minimums reaching record lows.
Climate models project continued warming under all emission scenarios, with
temperature increases ranging from 1.5 to 4.5 degrees by 2100. Extreme weather
events including heatwaves, droughts, and intense precipitation are becoming
more frequent and severe. Mitigation strategies focus on reducing emissions
through renewable energy adoption, energy efficiency improvements, and carbon
capture technologies. Adaptation measures include building resilient
infrastructure, developing drought-resistant crops, and implementing early
warning systems for extreme weather events.
"""
n4l_output = text_to_n4l(factual_text; chapter="climate science", target_percent=60.0)
println("--- Generated N4L ---")
println(n4l_output)--- Generated N4L ---
- Samples from climate science
# (begin) ************
:: _sequence_ , climate science::
@sen1 The global mean surface temperature has increased by approximately 1.1 degrees
" (extract-fr) part 0 of climate science
@sen2 This warming is primarily driven by
" (extract-fr) part 0 of climate science
@sen3 anthropogenic greenhouse gas emissions, particularly carbon dioxide from fossil
" (extract-fr) part 0 of climate science
@sen4 The ocean has absorbed more than 90 percent of the excess heat,
" (extract-fr) part 0 of climate science
@sen5 leading to thermal expansion and contributing to sea level rise.
" (extract-fr) part 0 of climate science
@sen6 extent has declined significantly, with September minimums reaching record lows.
" (extract-fr) part 0 of climate science
@sen7 Climate models project continued warming under all emission scenarios, with
" (extract-fr) part 0 of climate science
@sen8 temperature increases ranging from 1.5 to 4.5 degrees by 2100.
" (extract-fr) part 0 of climate science
@sen9 Extreme weather
" (extract-fr) part 0 of climate science
@sen10 events including heatwaves, droughts, and intense precipitation are becoming
" (extract-fr) part 0 of climate science
@sen11 more frequent and severe.
" (extract-fr) part 0 of climate science
@sen12 Mitigation strategies focus on reducing emissions
" (extract-fr) part 0 of climate science
@sen13 through renewable energy adoption, energy efficiency improvements, and carbon
" (extract-fr) part 0 of climate science
@sen14 Adaptation measures include building resilient
" (extract-fr) part 0 of climate science
@sen15 infrastructure, developing drought-resistant crops, and implementing early
" (extract-fr) part 0 of climate science
@sen16 warning systems for extreme weather events.
" (extract-fr) part 0 of climate science
-:: _sequence_ , climate science::
# (end) ************
# Final fraction 80.0 of requested 60.0
# Selected 16 samples of 20Step 2: Intentionality Scoring
The score_sentence function measures how “intentional” or meaningful a sentence is based on word frequency patterns. Let’s compare different text styles:
# Build vocabularies for different texts
emotional_text = """
I absolutely love the feeling of warm sunshine on my face! It fills me with
incredible joy and happiness. Nothing beats the wonderful sensation of a
beautiful summer morning. I feel so grateful and blessed to experience these
magical moments. My heart overflows with pure delight and contentment.
"""
technical_text = """
The TCP three-way handshake establishes a reliable connection between client
and server. The client sends a SYN packet with an initial sequence number.
The server responds with SYN-ACK, acknowledging the client's sequence number
and providing its own. The client completes the handshake by sending an ACK
packet. This mechanism ensures both endpoints are ready for bidirectional
data transfer with guaranteed delivery and ordering.
"""
for (label, text) in [("Factual/Climate", factual_text),
("Emotional", emotional_text),
("Technical/TCP", technical_text)]
sentences = SemanticSpacetime.split_sentences(SemanticSpacetime.clean_text(text))
vocab = SemanticSpacetime.build_vocab(sentences)
scores = [score_sentence(s, vocab) for s in sentences]
avg = isempty(scores) ? 0.0 : sum(scores) / length(scores)
println("$label: $(length(sentences)) sentences, avg score $(round(avg, digits=2))")
endFactual/Climate: 20 sentences, avg score 11.08
Emotional: 8 sentences, avg score 4.64
Technical/TCP: 9 sentences, avg score 16.51Score individual sentences to see which carry the most information:
sentences = SemanticSpacetime.split_sentences(SemanticSpacetime.clean_text(factual_text))
vocab = SemanticSpacetime.build_vocab(sentences)
println("Sentence intentionality scores:")
for (i, sent) in enumerate(sentences)
score = score_sentence(sent, vocab)
short = sent[1:min(60, length(sent))]
println(" [$i] $(round(score, digits=2)): $short...")
endSentence intentionality scores:
[1] 31.11: The global mean surface temperature has increased by approxi...
[2] 5.83: Celsius since the pre-industrial era....
[3] 9.91: This warming is primarily driven by...
[4] 9.94: anthropogenic greenhouse gas emissions, particularly carbon ...
[5] 0.0: fuel combustion....
[6] 20.07: The ocean has absorbed more than 90 percent of the excess he...
[7] 16.09: leading to thermal expansion and contributing to sea level r...
[8] 2.98: Arctic sea ice...
[9] 8.4: extent has declined significantly, with September minimums r...
[10] 10.94: Climate models project continued warming under all emission ...
[11] 27.78: temperature increases ranging from 1.5 to 4.5 degrees by 210...
[12] 13.92: Extreme weather...
[13] 7.21: events including heatwaves, droughts, and intense precipitat...
[14] 11.19: more frequent and severe....
[15] 0.0: Mitigation strategies focus on reducing emissions...
[16] 25.11: through renewable energy adoption, energy efficiency improve...
[17] 0.0: capture technologies....
[18] 0.0: Adaptation measures include building resilient...
[19] 7.21: infrastructure, developing drought-resistant crops, and impl...
[20] 13.92: warning systems for extreme weather events....Step 3: Extract Significant Sentences
The extract_significant_sentences function combines running and static intentionality analysis to select the most meaningful sentences:
selected = extract_significant_sentences(factual_text; target_percent=40.0)
println("Selected $(length(selected)) of $(length(sentences)) sentences:\n")
for (i, sent) in enumerate(selected)
println(" $i. $sent")
endSelected 13 of 20 sentences:
1. The global mean surface temperature has increased by approximately 1.1 degrees
2. anthropogenic greenhouse gas emissions, particularly carbon dioxide from fossil
3. The ocean has absorbed more than 90 percent of the excess heat,
4. leading to thermal expansion and contributing to sea level rise.
5. extent has declined significantly, with September minimums reaching record lows.
6. Climate models project continued warming under all emission scenarios, with
7. temperature increases ranging from 1.5 to 4.5 degrees by 2100.
8. Extreme weather
9. events including heatwaves, droughts, and intense precipitation are becoming
10. more frequent and severe.
11. through renewable energy adoption, energy efficiency improvements, and carbon
12. infrastructure, developing drought-resistant crops, and implementing early
13. warning systems for extreme weather events.Step 4: Compile Auto-Generated N4L
Now compile the generated N4L into a graph and explore it:
store = MemoryStore()
try
cr = compile_n4l_string!(store, n4l_output; config_dir=SST_CONFIG)
println("Compiled: $(cr.nodes_created) nodes, $(cr.edges_created) edges")
println("Chapters: ", join(cr.chapters, ", "))
catch e
println("Compilation note: ", e)
end
println("Graph: $(node_count(store)) nodes, $(link_count(store)) links")Compiled: 17 nodes, 31 edges
Chapters: Samples from climate science
Graph: 17 nodes, 62 linksStep 5: Search the Knowledge Graph
Search the auto-generated graph for key concepts:
for query in ["temperature", "carbon", "ocean", "emission", "climate"]
results = mem_search_text(store, query)
println("Search '$query': $(length(results)) results")
for n in results[1:min(2, length(results))]
short = n.s[1:min(50, length(n.s))]
println(" '$(short)...'")
end
endSearch 'temperature': 2 results
'temperature increases ranging from 1.5 to 4.5 degr...'
'The global mean surface temperature has increased ...'
Search 'carbon': 2 results
'anthropogenic greenhouse gas emissions, particular...'
'through renewable energy adoption, energy efficien...'
Search 'ocean': 1 results
'The ocean has absorbed more than 90 percent of the...'
Search 'emission': 3 results
'anthropogenic greenhouse gas emissions, particular...'
'Mitigation strategies focus on reducing emissions...'
Search 'climate': 2 results
'part 0 of climate science...'
'Climate models project continued warming under all...'Step 6: Compare Text Styles
Generate N4L from different text styles and compare the results:
for (label, text) in [("Emotional", emotional_text), ("Technical", technical_text)]
n4l = text_to_n4l(text; chapter=lowercase(label), target_percent=50.0)
s = MemoryStore()
try
cr = compile_n4l_string!(s, n4l; config_dir=SST_CONFIG)
println("$label: $(cr.nodes_created) nodes, $(cr.edges_created) edges")
catch e
println("$label: compilation note - $e")
end
println(" Graph: $(node_count(s)) nodes, $(link_count(s)) links")
endEmotional: 5 nodes, 7 edges
Graph: 5 nodes, 14 links
Technical: 6 nodes, 9 edges
Graph: 6 nodes, 18 linksStep 7: Graph Analysis of Extracted Knowledge
if node_count(store) > 0
adj = SemanticSpacetime.AdjacencyMatrix()
for (nptr, node) in store.nodes
for links in node.incidence
for link in links
SemanticSpacetime.add_edge!(adj, nptr, link.dst, Float64(link.wgt))
end
end
end
if !isempty(adj.nodes)
println(graph_summary(adj))
centrality = eigenvector_centrality(adj; max_iter=100, tol=1e-6)
ranked = sort(collect(centrality), by=x -> x.second, rev=true)
println("\nMost central nodes in extracted knowledge:")
for (nptr, score) in ranked[1:min(6, length(ranked))]
node = mem_get_node(store, nptr)
if node !== nothing
label = node.s[1:min(50, length(node.s))]
println(" $(rpad(label, 52)) $(round(score, digits=4))")
end
end
end
endGraph Summary
─────────────
Nodes: 17
Links: 62 (directed)
Sources: 0
Sinks: 0
Top centrality:
(4,2) 1.0000
(4,9) 0.3268
(2,1) 0.3268
(4,10) 0.3268
(4,8) 0.3268
Most central nodes in extracted knowledge:
part 0 of climate science 1.0
temperature increases ranging from 1.5 to 4.5 degr 0.3268
Extreme weather 0.3268
events including heatwaves, droughts, and intense 0.3268
Climate models project continued warming under all 0.3268
more frequent and severe. 0.3268Summary
The text→knowledge pipeline:
| Step | Function | Purpose |
|---|---|---|
| 1 | text_to_n4l(text) | Auto-generate N4L from plain text |
| 2 | score_sentence(text, vocab) | Measure intentionality of individual sentences |
| 3 | extract_significant_sentences(text) | Select the most meaningful sentences |
| 4 | compile_n4l_string!(store, n4l) | Build a knowledge graph from generated N4L |
| 5 | mem_search_text(store, query) | Search the resulting graph |
| 6 | eigenvector_centrality(adj) | Analyse structural importance |
This pipeline transforms unstructured text into a queryable, analysable knowledge graph — automatically identifying and preserving the most significant information.