GitHub - gbif/content-crawler: Crawls CMS and articles from Mendeley into ElasticSearch indexes

Content crawl

This project contains crawler code to access the Mendeley API and Contentful API and index them in an Elastic Search index.

Mendeley Crawl

The Mendeley API requires an application key and secret which can be created by:

Create an account on http://dev.mendeley.com/
Clicking on the "My Apps" tab
Creating an application, putting in http://localhost/ignored in the Redirect URL
Generating the secret (note, you need to copy it at this point, you can't view it afterwards)
Copying the secret and application ID (e.g. 4108 is an application ID) into you properties file

The Mendeley API uses OAUTH2 which normally redirects the user to a page to confirm details. We used anonymous access as described in this stack overflow and the Apache Oltu library. This is why http://localhost/ignored was used as the application redirect (we don't use it).

To run this and index into ES, assuming ES is running on localhost and answering http://localhost:9200/:

  // edit configuration to set cluster_name to your clustername listed on http://localhost:9200/
  $ mvn package
  $ java -jar target/content-crawler.jar con-crawl --conf configuration.yml

a successful run will look like:

INFO  [2017-03-01 16:41:43,996+0100] [main] org.gbif.mendeley.crawl.MendeleyDocumentCrawler: Requesting data from https://api.mendeley.com/documents?group_id=dcb8ff61-dbc0-3519-af76-2072f22bc22f&limit=500
INFO  [2017-03-01 16:41:44,497+0100] [main] org.gbif.mendeley.crawl.MendeleyDocumentCrawler: Mendeley reports total results: 4330
INFO  [2017-03-01 16:41:44,989+0100] [main] org.gbif.mendeley.crawl.ElasticSearchIndexHandler: Indexed [500] documents
INFO  [2017-03-01 16:41:44,990+0100] [main] org.gbif.mendeley.crawl.MendeleyDocumentCrawler: Requesting data from https://api.mendeley.com/documents?group_id=dcb8ff61-dbc0-3519-af76-2072f22bc22f&marker=d0f140bf-5b19-37e9-b9d6-84865e438eed&limit=500&reverse=false&order=asc
INFO  [2017-03-01 16:41:45,627+0100] [main] org.gbif.mendeley.crawl.ElasticSearchIndexHandler: Indexed [500] documents
INFO  [2017-03-01 16:41:45,627+0100] [main] org.gbif.mendeley.crawl.MendeleyDocumentCrawler: Requesting data from https://api.mendeley.com/documents?group_id=dcb8ff61-dbc0-3519-af76-2072f22bc22f&marker=04c9ab27-26d3-33fb-9e8e-d001ca9b4021&limit=500&reverse=false&order=asc
INFO  [2017-03-01 16:41:46,219+0100] [main] org.gbif.mendeley.crawl.ElasticSearchIndexHandler: Indexed [500] documents
INFO  [2017-03-01 16:41:46,219+0100] [main] org.gbif.mendeley.crawl.MendeleyDocumentCrawler: Requesting data from https://api.mendeley.com/documents?group_id=dcb8ff61-dbc0-3519-af76-2072f22bc22f&marker=dab2d8a4-1840-387c-88e5-b5639c0caf99&limit=500&reverse=false&order=asc
INFO  [2017-03-01 16:41:46,686+0100] [main] org.gbif.mendeley.crawl.ElasticSearchIndexHandler: Indexed [500] documents
INFO  [2017-03-01 16:41:46,687+0100] [main] org.gbif.mendeley.crawl.MendeleyDocumentCrawler: Requesting data from https://api.mendeley.com/documents?group_id=dcb8ff61-dbc0-3519-af76-2072f22bc22f&marker=fa9ee2cb-635f-3adf-995d-a8f8354f73e9&limit=500&reverse=false&order=asc
INFO  [2017-03-01 16:41:47,123+0100] [main] org.gbif.mendeley.crawl.ElasticSearchIndexHandler: Indexed [500] documents
INFO  [2017-03-01 16:41:47,124+0100] [main] org.gbif.mendeley.crawl.MendeleyDocumentCrawler: Requesting data from https://api.mendeley.com/documents?group_id=dcb8ff61-dbc0-3519-af76-2072f22bc22f&marker=fd124e8d-219c-38eb-9d46-dee67964b555&limit=500&reverse=false&order=asc
INFO  [2017-03-01 16:41:47,832+0100] [main] org.gbif.mendeley.crawl.ElasticSearchIndexHandler: Indexed [500] documents
INFO  [2017-03-01 16:41:47,832+0100] [main] org.gbif.mendeley.crawl.MendeleyDocumentCrawler: Requesting data from https://api.mendeley.com/documents?group_id=dcb8ff61-dbc0-3519-af76-2072f22bc22f&marker=2cbd9481-fbeb-3f40-ae3b-bbb41b2aaa7c&limit=500&reverse=false&order=asc
INFO  [2017-03-01 16:41:48,454+0100] [main] org.gbif.mendeley.crawl.ElasticSearchIndexHandler: Indexed [500] documents
INFO  [2017-03-01 16:41:48,454+0100] [main] org.gbif.mendeley.crawl.MendeleyDocumentCrawler: Requesting data from https://api.mendeley.com/documents?group_id=dcb8ff61-dbc0-3519-af76-2072f22bc22f&marker=ac4b864f-c540-36cd-a0f3-dc38cb1efee5&limit=500&reverse=false&order=asc
INFO  [2017-03-01 16:41:49,039+0100] [main] org.gbif.mendeley.crawl.ElasticSearchIndexHandler: Indexed [500] documents
INFO  [2017-03-01 16:41:49,039+0100] [main] org.gbif.mendeley.crawl.MendeleyDocumentCrawler: Requesting data from https://api.mendeley.com/documents?group_id=dcb8ff61-dbc0-3519-af76-2072f22bc22f&marker=a7d72802-57b3-30e3-9596-a7512db65edd&limit=500&reverse=false&order=asc
INFO  [2017-03-01 16:41:49,597+0100] [main] org.gbif.mendeley.crawl.ElasticSearchIndexHandler: Indexed [330] documents

Which can be queried e.g. $ curl -XPOST "http://localhost:9200/mendeley4/_search?pretty=true" -d '{"query" : { "query_string" : {"query" : "*"} }}'

Contentful Crawl

The Contentful API requires an authentication token and space id than

Name		Name	Last commit message	Last commit date
Latest commit History 353 Commits
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Content crawl

Mendeley Crawl

Contentful Crawl

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

gbif/content-crawler

Folders and files

Latest commit

History

Repository files navigation

Content crawl

Mendeley Crawl

Contentful Crawl

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages