Free software for journalists: Tutorials, bookmarks and open source tools for journalistic research, investigations and privacy and other digital tools for investigative journalism and data driven journalism or datajournalism:
Independent media tools for journalists and investigative reporting
With free open source software it is possible to run research tools for sensitive documents or data on your own computer or server instead of spying cloud services.
Tutorials and tips: How to use open source research tools for investigative journalism
- How to search, sort, explore and filter large document collections or many search results
- How to use boolean search operators
- Tagging and annotation for collaborative investigative journalism
- How to (fuzzy) search by a list which entries or names of the list occure in document sets
- How to structure data for exploratory search, aggregated overviews and interactive filters with facets and named entities
- How to integrate Open Data i.e. from Wikidata to enhance search and structure your document collections
- How to setup an own desktop search engine for many documents in a virtual maschine
- How to setup an search engine on an encrypted usb key or external harddrive
- Tips, tools and How-tos for safer online communications: Surveillance self-defense
- Security in a box
- Encryption works
- Information Security for Journalists
- How to build interactive maps with CartoDB
- Understanding language data: Open-source NLP software can help
- How to scrape structured data from websites with Python and Scrapy
Toolbox: Free software, open source tools and resources
Free software and open source discovery and research tools for journalists:
Search engines for fulltext search and discovery
Research methods, techniques and technology: Fulltext search, Information retrieval, Desktop Search, Enterprise Search and faceted search
Tutorials:
- How to search, sort, explore and filter large document collections or many search results
- How to use boolean search operators
Open source search tools:
- Open Semantic Search: Own semantic search engine
- Open Semantic Desktop Search: Own search engine for single desktop users and laptops
- InvestigateIX: Secure search engine on encrypted external devices
- Recoll: Desktop search for Linux
- Fuzzy search with lists
- FESS: Enterprise Search engine with user interfaces for search and crawling of files and websites
- Kibana for Elastic Search: Search and datavisualization of a Elastic Search index
- Banana for Solr: Search and datavisualization of a Solr index
- HUE Solr search: Search and datavisualization of a Solr index
- Open Semantic ETL for Solr or Open Semantic ETL for Elastic Search: Tools to import files and documents of different file formats to a search index
- Apache Manifold CF: Tools to import files to an Elastic Search or Solr search index
Search libraries and APIs
If you want code yourself, you can use this powerful engines as base:
- Solr: Index and search API
- Elastic Search: Index and search API
Databases, digital archives, data management systems, document management systems and content management systems
Methods: Archive, database, forms, categories (tagging), classification, meta data, repository, document management (DMS), content management (CMS) or enterprise content management (ECM), knowledge management, knowledge base, bookmarks
- Zotero: Bookmark database and citations manager with tagging and annotation features
- Docear: Bookmark database and citations manager with mindmap, tagging and annotation features
- LibreOffice Calc: Open source spreadsheet program
- Document cloud: Document management system for paper based documents like scans or PDF
- Semantic Mediawiki: Extends Mediawiki to a semantic data base
- Drupal CMS: The CMS module fields provides an easy to use UI to create own content types, data fields and forms
- Agorum: Automated extraction of structured amounts of money from bills
Tagging and annotation
Methods: Annotation, Tagging, Social Tagging, Folxonomies
Tutorial: Tagging and annotation for collaborative investigative journalism
- Zotero: Bookmark database and citations manager with tagging and annotation features
- Docear: Bookmark database and citations manager with mindmap, tagging and annotation features
- Document Cloud: Tagging and annotation for paper based documents like scans or PDF documents
- Neonion: Collaborative annotations within text
- Pundit: Annotations within text and within images
- Hypothesis
- Annotator.js
Text mining, text analysis and document mining
Method: Text mining, Natural Language Processing (NLP), Named entities extraction
- Text mining tutorial: How to analyze large document collections: Text mining with the search engine Open Semantic Search
- Understanding language data: Open-source NLP software can help
- Overview project: Showing most used words and trees of most used words
- Jigsaw: Text mining tool (not open source, but free download)
More:
- Wikipedia list of open source text mining software
- Tapor: Text Analysis Portal for Research
Reconcilation and merging
Methods: Compare, merge, reconcile, link, clustering
- Fuzzy search with lists: Checks, if there are search result for each list entry
- OpenRefine
- DocDiff: Shows and visualize the differences between two versions of a text
- Fslint: Compares two directories and searches for same files which are in both directories
Graphs and social network analysis (SNA)
Tools to analyze and visualize connections and relations:
- Network analysis tutorial: How to visualize connections & relations in documents with Open Semantic Search
- Gephi: Desktop tool for analysis and data visualization of networks, connections and graphs
- Cytoscape.js: Javascript library for data visualization of networks, connections and graphs
- Semantic Mediawiki: Very flexible CMS for linked data
- Detective: Python/Django and neo4j graph database based CMS for connections
Privacy, security, safety and encryption
Digital security: Protect your research, sources and whistleblowers with privacy tools and encryption tools:
Methods: Encryption (PGP, OTR) and anonymization
Tutorials:
- Surveillance self-defense: Tips, Tools and How-tos for Safer Online Communications
- Security in a box
- Encryption works
- How to setup an search engine on an encrypted usb key or external harddrive
- Information Security for Journalists
Open source tools:
- Tails - the amnesic incognito live system Linux based operating system for encryption and anonymous access of the internet
- Truecrypt: Hard disk encryption for windows
- GNUPG: Open PGP based - Email encryption
- Enigmail: Encryption plugin for the Thunderbird E-Mail client
- Tor project: Anonymity online
- OTR: Encryption for chats and instant messaging
- Textsecure: Messenger for encryption (like Whatsapp but for privacy)
- Jitsy: Encrypted communicator (like Skype but open source and safer end to end encryption)
- Redphone: Encrypted voice over IP communicator for smartphones
- Secure Drop: Upload platform for whistleblowers
- Global Leaks: Another upload platform for whistleblowing
Media monitoring, news filtering, news pipes and alerts
Open source software for media monitoring, news processing, news filtering and alerting:
- Open Semantic Search rules for news pipes and alerts: Filters and alerts for news from different news sources and data sources. Has a very powerfull filter and search query language (Apache Lucene based), f.e. supporting fuzzy search. Supports many file formats and data sources because you can use all standard connectors for Solr.
- Mozilla Thunderbird: Desktop software for reading, filtering and autotagging RSS-Feeds
- Streamtools: Visual news pipes for stream processing from the New York Times Lab
- Huginn: Ruby on rails and SQL based agents
Extract data or convert data
Methods: Data integration, extraction, data converter, data migration, ETL (Extract Transfer Load), Scraping
Extract text or structured data from documents
- Documents: Tika content analysis toolkit: Extract text and meta data from documents of many different file formats
- CSV tables: CSV Manager: Import big csv spreadsheets to Solr based search engines
- PDF tables: Tabula: Extracts spreadsheets from PDF documents
- Scans and images: Optical character regognition (OCR)
Extract text from images (OCR)
- Tesseract: OCR Software to recognize text from images
- Scantailor: Deskewing low quality scans
Extract text from sound files (speech recognition)
- CMU Sphinx: Open source speech recognition toolkit
Extract structured data from websites (Scraping)
- Portia: Extract structured data from websites by a visual user interface
- Scrapy: Extract structured data from websites by Python scrapers
Extract transform load (ETL) Frameworks for import and transform or convert data
- Transform to plain text: Tika content analysis toolkit
- Apache NiFi: Extract, transform, load and distribute data
- Talend Open Studio: Import and transform data to other formats
- Kettle: Import and transform data to other formats
- LogStash: Import and transform data from datasources like logfiles to an structured search index
Data visualization
Method: data visualization
Tools for data visualization or data visualisation:
- Kibana for Elastic Search: User interface for search, interactve filtering and data visualization
- D3js data driven documents: Data visualization library for Javascript programmers
- CartoDB: Open source webapplication and mapping tool for data visualization of spatial data
- Apache Zeppelin: Interactive data analysis and data visualisation plattform
- TimelineJS: Creating timelines
- Cytoscape.js: Javascript library for data visualization of networks, connections and graphs
- Semantic result formats: Data visualizations for data from a Semantic Mediawiki
Charts and diagrams
- Datawrapper - Webapp and user interface for easy generating charts
- HUE Solr search
- Kibana for Elastic Search
- Apache Zeppelin
- Superset
- Banana for Solr
- NVD3: Javascript library for easy programming of charts with D3
Maps and mapping (spatial data)
Create interactive maps and visualize spatial data (geodata) with open source software for mapping:
- CartoDB: Open source webapplication and mapping tool for interactive maps
- QGIS: Open source desktop tool for maps
- Leaflet: Javascript library for interactive maps
- Open Layers: Powerfull javascript library for maps
- Open Street Map: Open source and open data for maps
- GeoParsePy: Open source for geo parsing to extract geodata for mapping like places and locations from text
- Serving tiles: How to run your own map server with open source software
Visualize events on a timeline
Create timelines with open source timeline tools and visualize events on interactive multimedia timelines:
- Tutorial on timelines
- TimelineJS
- Simile Timeline
- Odyssey.js: Combines a timeline with a map for timelines for spatial data
Graphs, networks, connections and relations
- Network analysis tutorial: How to visualize connections & relations in documents with open semantic search
- Gephi: Desktop tool for analysis and data visualization of networks, connections and graphs
- Cytoscape.js: Javascript library for data visualization of networks, connections and graphs
- Sigma js: Javascript library for data visualization of networks, connections and graphs
Redact documents and delete meta data
Clean sensitive documents and delete meta data stored invisible inside the document files or photos like serial numbers of hardware (i.e. of your photo camera) or software or user names:
- PDF Redact Tools: Most secure way to delete meta data from PDFs
- MAT: Metadata Anonymisation Toolkit: Userinterface to delete meta data from different document formats and image formats
Statistics and analytics
Method: Data analysis, statistics, chart, diagram, data visualization
- LibreOffice Calc: Open source spreadsheet program
- HUE Solr search
- Kibana for Elastic Search
- Statistical software: Specialized computer programs for statistical analysis and econometric analysis
- Business Intelligence: Tools for statistics and analytics
- Programming with R or Python or another programming language
- Business Intelligence: Tools for statistics and analytics
- Mining of massive datasets: Book (free PDF download) explaining data mining methods
Universal open source toolset
The ultimate universal open source toolset is a Linux distribution like Debian GNU/Linux or Ubuntu Linux comming with thousands of packages of free software and open source tools, software libraries and programming languages.
You dont have to remove your existing operating system: With open-source virtualization software like Virtual Box for Windows or Mac you can run a Linux distribution within a window in your existing operating system environment.
Maybe you want to start with Linux on your existing system environment with the preconfigurated Debian based virtual maschine (VM) Open Semantic Desktop Search providing a preselected and preconfigurated collection of tools for investigative journalists.